Ilya Sutskever: Deep Learning | Lex Fridman Podcast #94

00:00:00.000 | The following is a conversation with Ilya Sutskever,

00:00:03.140 | co-founder and chief scientist of OpenAI,

00:00:06.120 | one of the most cited computer scientists in history

00:00:09.360 | with over 165,000 citations.

00:00:13.480 | And to me, one of the most brilliant and insightful minds

00:00:17.060 | ever in the field of deep learning.

00:00:20.000 | There are very few people in this world

00:00:21.680 | who I would rather talk to and brainstorm with

00:00:24.040 | about deep learning, intelligence, and life in general

00:00:27.760 | than Ilya, on and off the mic.

00:00:30.640 | This was an honor and a pleasure.

00:00:32.820 | This conversation was recorded

00:00:35.240 | before the outbreak of the pandemic.

00:00:37.200 | For everyone feeling the medical, psychological,

00:00:39.480 | and financial burden of this crisis,

00:00:41.440 | I'm sending love your way.

00:00:43.160 | Stay strong, we're in this together, we'll beat this thing.

00:00:47.160 | This is the Artificial Intelligence Podcast.

00:00:49.640 | If you enjoy it, subscribe on YouTube,

00:00:51.760 | review it with Five Stars and Apple Podcast,

00:00:54.040 | support it on Patreon, or simply connect with me on Twitter

00:00:57.000 | at Lex Friedman, spelled F-R-I-D-M-A-N.

00:01:00.560 | As usual, I'll do a few minutes of ads now

00:01:03.000 | and never any ads in the middle

00:01:04.320 | that can break the flow of the conversation.

00:01:06.600 | I hope that works for you

00:01:07.980 | and doesn't hurt the listening experience.

00:01:10.120 | This show is presented by Cash App,

00:01:13.440 | the number one finance app in the App Store.

00:01:15.740 | When you get it, use code LEXPODCAST.

00:01:18.860 | Cash App lets you send money to friends, buy Bitcoin,

00:01:22.060 | invest in the stock market with as little as $1.

00:01:25.440 | Since Cash App allows you to buy Bitcoin,

00:01:27.480 | let me mention that cryptocurrency

00:01:29.280 | in the context of the history of money is fascinating.

00:01:33.060 | I recommend "A Scent of Money"

00:01:34.640 | as a great book on this history.

00:01:36.800 | Both the book and audiobook are great.

00:01:39.560 | Debits and credits on ledgers

00:01:41.000 | started around 30,000 years ago.

00:01:43.900 | The US dollar created over 200 years ago,

00:01:47.160 | and Bitcoin, the first decentralized cryptocurrency,

00:01:50.000 | released just over 10 years ago.

00:01:52.040 | So given that history,

00:01:53.480 | cryptocurrency is still very much in its early days

00:01:55.920 | of development, but it's still aiming to,

00:01:58.200 | and just might, redefine the nature of money.

00:02:01.800 | So again, if you get Cash App

00:02:03.520 | from the App Store or Google Play and use the code LEXPODCAST,

00:02:08.000 | you get $10, and Cash App will also donate $10 to FIRST,

00:02:12.440 | an organization that is helping advance robotics

00:02:14.820 | and STEM education for young people around the world.

00:02:17.620 | And now, here's my conversation with Ilya Sutskever.

00:02:23.360 | You were one of the three authors,

00:02:25.240 | with Alex Krashevsky, Jeff Hinton,

00:02:27.720 | of the famed Alex Ned paper

00:02:30.120 | that is arguably the paper

00:02:33.000 | that marked the big catalytic moment

00:02:35.120 | that launched the deep learning revolution.

00:02:37.840 | At that time, take us back to that time,

00:02:39.560 | what was your intuition about neural networks,

00:02:42.240 | about the representational power of neural networks?

00:02:45.960 | And maybe you could mention,

00:02:47.580 | how did that evolve over the next few years,

00:02:50.840 | up to today, over the 10 years?

00:02:53.480 | - Yeah, I can answer that question.

00:02:55.240 | At some point in about 2010 or 2011,

00:02:58.600 | I connected two facts in my mind.

00:03:02.600 | Basically, the realization was this.

00:03:06.700 | At some point, we realized that we can train very large,

00:03:11.280 | I shouldn't say very, you know,

00:03:12.120 | they were tiny by today's standards,

00:03:13.400 | but large and deep neural networks,

00:03:16.560 | end to end with back propagation.

00:03:18.540 | At some point, different people obtained this result.

00:03:22.380 | I obtained this result.

00:03:23.820 | The first moment in which I realized

00:03:26.420 | that deep neural networks are powerful

00:03:29.000 | was when James Martens invented

00:03:30.780 | the Hessian Free Optimizer in 2010.

00:03:33.620 | And he trained a 10-layer neural network,

00:03:36.300 | end to end, without pre-training, from scratch.

00:03:40.620 | And when that happened, I thought, this is it.

00:03:43.940 | Because if you can train a big neural network,

00:03:45.620 | a big neural network can represent very complicated function.

00:03:49.500 | Because if you have a neural network with 10 layers,

00:03:52.700 | it's as though you allow the human brain

00:03:55.260 | to run for some number of milliseconds.

00:03:58.340 | Neuron firings are slow,

00:04:00.380 | and so in maybe 100 milliseconds,

00:04:03.300 | your neurons only fire 10 times.

00:04:04.700 | So it's also kind of like 10 layers.

00:04:06.780 | And in 100 milliseconds,

00:04:08.140 | you can perfectly recognize any object.

00:04:10.460 | So I thought, so I already had the idea then

00:04:13.100 | that we need to train a very big neural network

00:04:16.100 | on lots of supervised data, and then it must succeed,

00:04:19.420 | because we can find the best neural network.

00:04:21.380 | And then there's also theory

00:04:22.740 | that if you have more data than parameters,

00:04:24.500 | you won't overfit.

00:04:25.760 | Today, we know that actually this theory is very incomplete

00:04:28.100 | and you won't overfit

00:04:28.940 | even if you have less data than parameters.

00:04:30.380 | But definitely, if you have more data than parameters,

00:04:32.500 | you won't overfit.

00:04:33.340 | - So the fact that neural networks

00:04:34.700 | were heavily over-parameterized wasn't discouraging to you?

00:04:39.100 | So you were thinking about the theory

00:04:41.220 | that the number of parameters,

00:04:43.080 | the fact there's a huge number of parameters is okay?

00:04:45.260 | It's gonna be okay?

00:04:46.100 | - I mean, there was some evidence before

00:04:47.300 | that it was okay-ish,

00:04:48.260 | but the theory was that if you had a big dataset

00:04:51.500 | and a big neural net, it was going to work.

00:04:53.060 | The over-parameterization just didn't really figure much

00:04:56.300 | as a problem.

00:04:57.140 | I thought, well, with images,

00:04:57.960 | you're just gonna add some data augmentation

00:04:59.260 | and it's gonna be okay.

00:05:00.420 | - So where was any doubt coming from?

00:05:02.460 | - The main doubt was, can we train a big,

00:05:04.460 | will we have enough compute to train a big enough neural net?

00:05:06.420 | - With backpropagation.

00:05:07.580 | - Backpropagation, I thought, would work.

00:05:09.460 | The thing which wasn't clear

00:05:10.700 | was whether there would be enough compute

00:05:12.500 | to get a very convincing result.

00:05:14.140 | And then at some point, Alex Kirzhevsky

00:05:15.580 | wrote these insanely fast CUDA kernels

00:05:17.540 | for training convolutional neural nets.

00:05:19.220 | And that was, bam, let's do this.

00:05:20.940 | Let's get an image in it,

00:05:21.780 | and it's gonna be the greatest thing.

00:05:23.460 | - Was most of your intuition from empirical results

00:05:27.340 | by you and by others?

00:05:29.580 | So like, just actually demonstrating

00:05:31.140 | that a piece of program can train

00:05:33.180 | a 10-layer neural network?

00:05:34.700 | Or was there some pen and paper or marker and whiteboard

00:05:39.260 | thinking intuition?

00:05:40.760 | 'Cause you just connected a 10-layer large neural network

00:05:44.740 | to the brain, so you just mentioned the brain.

00:05:46.620 | So in your intuition about neural networks,

00:05:49.220 | does the human brain come into play as a intuition builder?

00:05:53.860 | - Definitely.

00:05:55.020 | I mean, you gotta be precise with these analogies

00:05:57.520 | between artificial neural networks and the brain.

00:06:00.300 | But there's no question that the brain is a huge source

00:06:04.100 | of intuition and inspiration for deep learning researchers

00:06:07.460 | since all the way from Rosenblatt in the '60s.

00:06:10.820 | Like, if you look at, the whole idea of a neural network

00:06:13.860 | is directly inspired by the brain.

00:06:15.740 | You had people like McCallum and Pitts who were saying,

00:06:17.980 | "Hey, you got these neurons in the brain.

00:06:21.980 | "And hey, we recently learned about the computer

00:06:23.760 | "and automata, can we use some ideas

00:06:25.340 | "from the computer and automata to design

00:06:27.180 | "some kind of computational object

00:06:28.700 | "that's going to be simple, computational,

00:06:31.600 | "and kind of like the brain?"

00:06:32.820 | And they invented the neuron.

00:06:34.420 | So they were inspired by it back then.

00:06:36.020 | Then you had the convolutional neural network

00:06:37.500 | from Fukushima, and then later, Jan Lekan,

00:06:39.940 | who said, "Hey, if you limit the receptive fields

00:06:42.020 | "of a neural network, it's gonna be especially suitable

00:06:44.320 | "for images," as it turned out to be true.

00:06:47.020 | So there was a very small number of examples

00:06:49.980 | where analogies to the brain were successful.

00:06:52.380 | And I thought, well, probably an artificial neuron

00:06:55.140 | is not that different from the brain

00:06:56.780 | if you squint hard enough.

00:06:57.700 | So let's just assume it is and roll with it.

00:07:00.980 | - So we're now at a time where deep learning

00:07:02.820 | is very successful, so let us squint less

00:07:06.500 | and say, let's open our eyes and say,

00:07:09.220 | what to you is an interesting difference

00:07:12.100 | between the human brain, now I know you're probably

00:07:14.700 | not an expert, neither a neuroscientist

00:07:17.460 | or a neurobiologist, but loosely speaking,

00:07:19.780 | what's the difference between the human brain

00:07:21.260 | and artificial neural networks that's interesting to you

00:07:24.060 | for the next decade or two?

00:07:26.340 | - That's a good question to ask.

00:07:27.380 | What is an interesting difference between the brain

00:07:30.940 | and our artificial neural networks?

00:07:32.940 | So I feel like today, artificial neural networks,

00:07:37.140 | so we all agree that there are certain dimensions

00:07:39.420 | in which the human brain vastly outperforms our models.

00:07:43.060 | But I also think that there are some ways

00:07:44.420 | in which our artificial neural networks

00:07:46.220 | have a number of very important advantages over the brain.

00:07:50.200 | Looking at the advantages versus disadvantages

00:07:52.580 | is a good way to figure out what is the important difference.

00:07:55.640 | So the brain uses spikes, which may or may not be important.

00:08:00.140 | - That's a really interesting question.

00:08:01.380 | Do you think it's important or not?

00:08:03.860 | That's one big architectural difference

00:08:06.380 | between artificial neural networks.

00:08:08.380 | - It's hard to tell, but my prior is not very high

00:08:11.700 | and I can say why.

00:08:13.500 | There are people who are interested

00:08:14.340 | in spiking neural networks and basically,

00:08:16.500 | what they figured out is that they need to simulate

00:08:19.260 | the non-spiking neural networks in spikes.

00:08:21.620 | And that's how they're gonna make them work.

00:08:24.300 | If you don't simulate the non-spiking neural networks

00:08:26.300 | in spikes, it's not going to work

00:08:27.760 | because the question is, why should it work?

00:08:29.540 | And that connects to questions around back propagation

00:08:31.820 | and questions around deep learning.

00:08:34.860 | You've got this giant neural network.

00:08:36.900 | Why should it work at all?

00:08:38.420 | Why should the learning rule work at all?

00:08:40.460 | It's not a self-evident question, especially if you,

00:08:45.860 | let's say if you were just starting in the field

00:08:47.540 | and you read the very early papers,

00:08:49.340 | you can say, "Hey," people are saying,

00:08:51.480 | "Let's build neural networks."

00:08:53.740 | That's a great idea because the brain is a neural network,

00:08:55.900 | so it would be useful to build neural networks.

00:08:58.020 | Now let's figure out how to train them.

00:09:00.420 | It should be possible to train them probably, but how?

00:09:03.460 | And so the big idea is the cost function.

00:09:06.340 | That's the big idea.

00:09:08.780 | The cost function is a way of measuring the performance

00:09:11.900 | of the system according to some measure.

00:09:14.920 | By the way, that is a big, actually, let me think.

00:09:17.180 | Is that, one, a difficult idea to arrive at

00:09:21.180 | and how big of an idea is that,

00:09:22.740 | that there's a single cost function?

00:09:27.620 | - Sorry, let me take a pause.

00:09:28.900 | Is supervised learning a difficult concept to come to?

00:09:33.340 | - I don't know.

00:09:34.660 | All concepts are very easy in retrospect.

00:09:36.460 | - Yeah, that's what, it seems trivial now, but I,

00:09:38.940 | 'cause the reason I ask that, and we'll talk about it,

00:09:41.460 | 'cause is there other things?

00:09:43.460 | Is there things that don't necessarily have a cost function,

00:09:47.180 | maybe have many cost functions,

00:09:48.640 | or maybe have dynamic cost functions,

00:09:50.900 | or maybe a totally different kind of architectures?

00:09:54.180 | 'Cause we have to think like that

00:09:55.500 | in order to arrive at something new, right?

00:09:57.980 | - So the only, so the good examples of things

00:09:59.940 | which don't have clear cost functions are GANs.

00:10:02.440 | - Right. - In a GAN, you have a game.

00:10:05.740 | So instead of thinking of a cost function

00:10:08.240 | where you wanna optimize,

00:10:09.260 | where you know that you have an algorithm gradient descent,

00:10:12.100 | which will optimize the cost function,

00:10:13.940 | and then you can reason about the behavior of your system

00:10:16.340 | in terms of what it optimizes.

00:10:18.140 | With a GAN, you say, "I have a game,

00:10:20.020 | "and I'll reason about the behavior of the system

00:10:22.160 | "in terms of the equilibrium of the game."

00:10:24.540 | But it's all about coming up with these mathematical objects

00:10:26.540 | that help us reason about the behavior of our system.

00:10:30.140 | - Right, that's really interesting.

00:10:31.180 | Yeah, so GAN is the only one, it's kind of a,

00:10:33.420 | the cost function is emergent from the comparison.

00:10:36.900 | - It's, I don't know if it has a cost function.

00:10:39.020 | I don't know if it's meaningful

00:10:39.860 | to talk about the cost function of a GAN.

00:10:41.360 | It's kind of like the cost function of biological evolution

00:10:44.020 | or the cost function of the economy.

00:10:45.700 | It's, you can talk about regions

00:10:49.460 | to which it will go towards, but I don't think,

00:10:53.780 | I don't think the cost function analogy is the most useful.

00:10:57.500 | - So if evolution doesn't, that's really interesting.

00:11:00.140 | So if evolution doesn't really have a cost function,

00:11:02.700 | like a cost function based on its,

00:11:04.940 | something akin to our mathematical conception

00:11:09.900 | of a cost function, then do you think cost functions

00:11:12.780 | in deep learning are holding us back?

00:11:15.180 | Yeah, so you just kind of mentioned that cost function

00:11:18.320 | is a nice first profound idea.

00:11:21.420 | Do you think that's a good idea?

00:11:23.380 | Do you think it's an idea we'll go past?

00:11:26.780 | So self-play starts to touch on that a little bit

00:11:29.620 | in reinforcement learning systems.

00:11:31.760 | - That's right.

00:11:32.600 | Self-play and also ideas around exploration

00:11:34.740 | where you're trying to take action

00:11:37.020 | that surprise a predictor.

00:11:39.140 | I'm a big fan of cost functions.

00:11:40.540 | I think cost functions are great

00:11:41.700 | and they serve us really well.

00:11:42.780 | And I think that whenever we can do things

00:11:44.580 | with cost functions, we should.

00:11:46.140 | And you know, maybe there is a chance

00:11:49.020 | that we will come up with some,

00:11:50.380 | yet another profound way of looking at things

00:11:52.700 | that will involve cost functions in a less central way.

00:11:55.620 | But I don't know, I think cost functions are, I mean,

00:11:58.180 | I would not bet against cost functions.

00:12:03.100 | - Is there other things about the brain

00:12:05.500 | that pop into your mind that might be different

00:12:08.240 | and interesting for us to consider

00:12:11.060 | in designing artificial neural networks?

00:12:13.540 | So we talked about spiking a little bit.

00:12:16.220 | - I mean, one thing which may potentially be useful,

00:12:18.620 | I think people, neuroscientists have figured out

00:12:20.580 | something about the learning rule of the brain,

00:12:22.220 | or I'm talking about spike time independent plasticity,

00:12:24.860 | and it would be nice if some people

00:12:26.420 | would just study that in simulation.

00:12:28.420 | - Wait, sorry, spike time independent plasticity?

00:12:30.940 | - Yeah, that's right. - What's that?

00:12:31.860 | - STD, it's a particular learning rule

00:12:34.020 | that uses spike timing to figure out how to,

00:12:36.740 | to determine how to update the synapses.

00:12:39.660 | So it's kind of like, if a synapse fires into the neuron

00:12:42.580 | before the neuron fires, then it's strengthen the synapse.

00:12:46.060 | And if the synapse fires into the neurons

00:12:48.020 | shortly after the neuron fired, then it weakens the synapse.

00:12:50.740 | Something along this line.

00:12:52.220 | I'm 90% sure it's right, so if I said something wrong here,

00:12:56.180 | don't, don't get too angry.

00:12:59.460 | - But you sound brilliant while saying it.

00:13:01.060 | But the timing, that's one thing that's missing.

00:13:04.200 | The temporal dynamics is not captured.

00:13:07.460 | I think that's like a fundamental property of the brain,

00:13:10.120 | is the timing of the signals.

00:13:13.340 | - Well, you have recurrent neural networks.

00:13:15.460 | But you think of that as this,

00:13:18.060 | I mean, that's a very crude, simplified,

00:13:20.320 | what's that called?

00:13:22.300 | There's a clock, I guess, to recurrent neural networks.

00:13:27.620 | It seems like the brain is the general,

00:13:30.100 | the continuous version of that, the generalization,

00:13:33.340 | where all possible timings are possible,

00:13:36.060 | and then within those timings is contained some information.

00:13:39.900 | You think recurrent neural networks,

00:13:42.020 | the recurrence in recurrent neural networks

00:13:45.460 | can capture the same kind of phenomena

00:13:48.860 | as the timing that seems to be important for the brain,

00:13:53.860 | in the firing of neurons in the brain?

00:13:56.300 | - I mean, I think recurrent neural networks are amazing,

00:14:00.700 | and they can do, I think they can do anything

00:14:03.860 | we'd want them to, we'd want a system to do.

00:14:07.660 | Right now, recurrent neural networks

00:14:09.020 | have been superseded by transformers,

00:14:10.460 | but maybe one day they'll make a comeback,

00:14:12.740 | maybe they'll be back, we'll see.

00:14:14.380 | - Let me, on a small tangent, say,

00:14:17.700 | do you think they'll be back?

00:14:19.080 | So, so much of the breakthroughs recently

00:14:21.320 | that we'll talk about on natural language processing

00:14:24.420 | and language modeling has been with transformers

00:14:28.060 | that don't emphasize recurrence.

00:14:29.980 | Do you think recurrence will make a comeback?

00:14:33.260 | - Well, some kind of recurrence, I think, very likely.

00:14:36.980 | Recurrent neural networks for,

00:14:38.700 | as they're typically thought of for processing sequences,

00:14:42.660 | I think it's also possible.

00:14:44.420 | - What is, to you, a recurrent neural network?

00:14:47.940 | And generally speaking, I guess,

00:14:49.300 | what is a recurrent neural network?

00:14:50.940 | - You have a neural network which maintains

00:14:52.360 | a high-dimensional hidden state.

00:14:54.940 | And then when an observation arrives,

00:14:56.820 | it updates its high-dimensional hidden state

00:14:59.300 | through its connections, in some way.

00:15:03.500 | - So do you think, you know,

00:15:05.660 | that's what, like, expert systems did, right?

00:15:08.140 | Symbolic AI, the knowledge-based,

00:15:12.380 | growing a knowledge base is maintaining a hidden state,

00:15:17.240 | which is its knowledge base,

00:15:18.460 | and is growing it by sequentially processing.

00:15:20.300 | Do you think of it more generally in that way?

00:15:22.700 | Or is it simply, is it the more constrained form

00:15:27.700 | of a hidden state with certain kind of gating units

00:15:31.340 | that we think of as today with LSTMs and that?

00:15:34.500 | - I mean, the hidden state is technically

00:15:36.220 | what you described there, the hidden state

00:15:37.860 | that goes inside the LSTM or the RNN or something like this.

00:15:41.380 | But then what should be contained, you know,

00:15:43.260 | if you want to make the expert system analogy, I'm not,

00:15:46.860 | I mean, you could say that the knowledge

00:15:49.660 | is stored in the connections,

00:15:51.100 | and then the short-term processing

00:15:53.220 | is done in the hidden state.

00:15:55.460 | - Yes, could you say that?

00:15:58.340 | - Yes.

00:15:59.180 | - So, sort of, do you think there's a future

00:16:01.100 | of building large-scale knowledge bases

00:16:04.420 | within the neural networks?

00:16:05.620 | - Definitely.

00:16:06.460 | (Lex laughing)

00:16:09.020 | - So, we're gonna pause in that confidence,

00:16:11.160 | 'cause I wanna explore that.

00:16:12.700 | But let me zoom back out and ask,

00:16:14.960 | back to the history of ImageNet.

00:16:19.340 | Neural networks have been around for many decades,

00:16:21.360 | as you mentioned.

00:16:22.740 | What do you think were the key ideas

00:16:24.220 | that led to their success, that ImageNet moment,

00:16:27.380 | and beyond the success in the past 10 years?

00:16:32.380 | - Okay, so the question is,

00:16:33.540 | to make sure I didn't miss anything,

00:16:35.540 | the key ideas that led to the success of deep learning

00:16:38.060 | over the past 10 years.

00:16:39.380 | - Exactly, even though the fundamental thing

00:16:42.900 | behind deep learning has been around for much longer.

00:16:45.380 | - So, the key idea about deep learning,

00:16:50.380 | or rather, the key fact about deep learning

00:16:53.940 | before deep learning started to be successful,

00:16:58.260 | is that it was underestimated.

00:16:59.780 | People who worked in machine learning

00:17:02.900 | simply didn't think that neural networks could do much.

00:17:06.300 | People didn't believe that large neural networks

00:17:08.820 | could be trained.

00:17:10.580 | People thought that, well, there was a lot of debate

00:17:14.460 | going on in machine learning

00:17:15.660 | about what are the right methods and so on.

00:17:17.300 | And people were arguing,

00:17:19.340 | because there was no way to get hard facts.

00:17:23.420 | And by that I mean, there were no benchmarks

00:17:25.460 | which were truly hard,

00:17:26.940 | that if you do really well on them,

00:17:28.460 | then you can say, "Look, here's my system."

00:17:32.580 | That's when you switch from...

00:17:34.060 | That's when this field becomes a little bit more

00:17:37.660 | of an engineering field.

00:17:38.620 | So, in terms of deep learning,

00:17:39.660 | to answer the question directly,

00:17:41.460 | the ideas were all there.

00:17:43.540 | The thing that was missing was a lot of supervised data

00:17:46.820 | and a lot of compute.

00:17:47.940 | Once you have a lot of supervised data and a lot of compute,

00:17:52.620 | then there is a third thing which is needed as well,

00:17:54.740 | and that is conviction.

00:17:56.380 | Conviction that if you take the right stuff,

00:17:59.180 | which already exists,

00:18:00.540 | and apply and mix it with a lot of data

00:18:02.460 | and a lot of compute,

00:18:03.580 | that it will in fact work.

00:18:05.020 | And so that was the missing piece.

00:18:07.780 | It was, you had the...

00:18:08.780 | You needed the data,

00:18:10.660 | you needed the compute,

00:18:11.580 | which showed up in terms of GPUs,

00:18:14.140 | and you needed the conviction to realize

00:18:15.820 | that you need to mix them together.

00:18:17.580 | - So that's really interesting.

00:18:19.420 | So, I guess the presence of compute

00:18:23.140 | and the presence of supervised data

00:18:25.180 | allowed the empirical evidence to do the convincing

00:18:29.660 | of the majority of the computer science community.

00:18:32.060 | So I guess there's a key moment

00:18:33.820 | with Jitendra Malik and Alex,

00:18:38.820 | Alyosha Efros,

00:18:40.340 | who were very skeptical, right?

00:18:42.580 | And then there's a Jeffrey Hinton

00:18:44.020 | that was the opposite of skeptical.

00:18:46.700 | And there was a convincing moment,

00:18:48.260 | and I think emission had served as that moment.

00:18:50.260 | - That's right.

00:18:51.100 | - And that represented this kind of,

00:18:52.940 | where the big pillars of computer vision community

00:18:55.900 | kinda, the wizards got together,

00:18:59.740 | and then all of a sudden there was a shift.

00:19:01.500 | And it's not enough for the ideas to all be there

00:19:05.300 | and the compute to be there,

00:19:06.300 | it's for it to convince the cynicism that existed.

00:19:10.460 | It's interesting that people just didn't believe

00:19:14.060 | for a couple of decades.

00:19:15.940 | - Yeah, well, but it's more than that.

00:19:18.580 | It's kind of, when put this way,

00:19:20.860 | it sounds like, well, you know,

00:19:21.780 | those silly people who didn't believe

00:19:24.300 | what were they missing.

00:19:25.580 | But in reality, things were confusing

00:19:27.540 | because neural networks really did not work on anything.

00:19:30.260 | And they were not the best method

00:19:31.460 | on pretty much anything as well.

00:19:33.580 | And it was pretty rational to say,

00:19:35.820 | yeah, this stuff doesn't have any traction.

00:19:37.980 | And that's why you need to have these very hard tasks

00:19:42.300 | which produce undeniable evidence.

00:19:44.900 | And that's how we make progress.

00:19:46.940 | And that's why the field is making progress today

00:19:48.620 | because we have these hard benchmarks

00:19:50.700 | which represent true progress.

00:19:52.780 | And this is why we are able to avoid endless debate.

00:19:58.340 | - So incredibly, you've contributed

00:20:00.540 | some of the biggest recent ideas in AI

00:20:03.060 | in computer vision, language, natural language processing,

00:20:07.060 | reinforcement learning, sort of everything in between.

00:20:11.320 | Maybe not GANs.

00:20:12.540 | There may not be a topic you haven't touched.

00:20:16.260 | And of course, the fundamental science of deep learning.

00:20:19.660 | What is the difference to you between vision, language,

00:20:24.220 | and as in reinforcement learning, action,

00:20:26.980 | as learning problems?

00:20:28.340 | And what are the commonalities?

00:20:29.580 | Do you see them as all interconnected?

00:20:31.540 | Are they fundamentally different domains

00:20:33.820 | that require different approaches?

00:20:36.780 | - Okay, that's a good question.

00:20:39.660 | Machine learning is a field with a lot of unity,

00:20:41.900 | a huge amount of unity.

00:20:43.240 | In fact-- - What do you mean by unity?

00:20:45.340 | Like overlap of ideas?

00:20:48.380 | - Overlap of ideas, overlap of principles.

00:20:50.180 | In fact, there's only one or two or three principles

00:20:52.700 | which are very, very simple.

00:20:54.380 | And then they apply in almost the same way,

00:20:57.380 | in almost the same way to the different modalities

00:20:59.940 | to the different problems.

00:21:01.380 | And that's why today, when someone writes a paper

00:21:04.140 | on improving optimization of deep learning and vision,

00:21:07.160 | it improves the different NLP applications

00:21:09.300 | and it improves the different

00:21:10.140 | reinforcement learning applications.

00:21:12.340 | Reinforcement learning, so I would say that computer vision

00:21:15.820 | and NLP are very similar to each other.

00:21:18.620 | Today, they differ in that they have

00:21:21.000 | slightly different architectures.

00:21:22.180 | We use transformers in NLP

00:21:23.900 | and we use convolutional neural networks in vision.

00:21:26.500 | But it's also possible that one day this will change

00:21:28.900 | and everything will be unified with a single architecture.

00:21:31.820 | Because if you go back a few years ago

00:21:33.660 | in natural language processing,

00:21:35.440 | there were a huge number of architectures

00:21:39.340 | for every different tiny problem had its own architecture.

00:21:42.240 | Today, there's just one transformer

00:21:45.860 | for all those different tasks.

00:21:47.420 | And if you go back in time even more,

00:21:49.660 | you had even more and more fragmentation

00:21:51.340 | and every little problem in AI

00:21:53.780 | had its own little subspecialization

00:21:55.900 | and sub, you know, little set of collection of skills,

00:21:58.620 | people who would know how to engineer the features.

00:22:00.960 | Now it's all been subsumed by deep learning.

00:22:02.860 | We have this unification.

00:22:04.100 | And so I expect vision to become unified

00:22:06.820 | with natural language as well.

00:22:08.500 | Or rather, I shouldn't say expect, I think it's possible.

00:22:10.460 | I don't wanna be too sure because I think

00:22:12.780 | on the convolutional neural net,

00:22:13.620 | it's very computationally efficient.

00:22:15.460 | RL is different.

00:22:16.820 | RL does require slightly different techniques

00:22:18.840 | because you really do need to take action.

00:22:20.780 | You really do need to do something about exploration.

00:22:23.840 | Your variance is much higher.

00:22:26.020 | But I think there is a lot of unity even there.

00:22:28.180 | And I would expect, for example,

00:22:29.300 | that at some point there will be some

00:22:31.140 | broader unification between RL and supervised learning,

00:22:35.220 | where somehow the RL will be making decisions

00:22:37.140 | to make the supervised learning go better.

00:22:38.540 | And it will be, I imagine one big black box

00:22:41.740 | and you just throw everything, you know,

00:22:43.260 | you shovel things into it and it just figures out

00:22:45.980 | what to do with whatever you shovel at it.

00:22:48.020 | - I mean, reinforcement learning has some aspects

00:22:50.740 | of language and vision combined almost.

00:22:55.140 | There's elements of a long-term memory

00:22:57.740 | that you should be utilizing and there's elements

00:22:59.660 | of a really rich sensory space.

00:23:03.060 | So it seems like the, it's like the union of the two

00:23:06.860 | or something like that.

00:23:08.380 | - I'd say something slightly differently.

00:23:09.980 | I'd say that reinforcement learning is neither,

00:23:12.680 | but it naturally interfaces and integrates

00:23:15.420 | with the two of them.

00:23:17.360 | - You think action is fundamentally different?

00:23:19.280 | So yeah, what is interesting about,

00:23:21.300 | what is unique about policy of learning to act?

00:23:26.020 | - Well, so one example, for instance,

00:23:27.500 | is that when you learn to act,

00:23:29.800 | you are fundamentally in a non-stationary world

00:23:33.220 | because as your actions change,

00:23:35.760 | the things you see start changing.

00:23:38.060 | You experience the world in a different way.

00:23:41.300 | And this is not the case for the more traditional

00:23:44.140 | static problem where you have some distribution

00:23:46.300 | and you just apply a model to that distribution.

00:23:48.600 | - You think it's a fundamentally different problem

00:23:51.180 | or is it just a more difficult,

00:23:53.900 | it's a generalization of the problem of understanding?

00:23:56.980 | - I mean, it's a question of definitions almost.

00:23:59.780 | There is a huge, I mean, no,

00:24:00.600 | there's a huge amount of commonality for sure.

00:24:01.940 | You take gradients, you try, you take gradients,

00:24:04.100 | we try to approximate gradients in both cases.

00:24:06.100 | In some, in the case of reinforcement learning,

00:24:07.900 | you have some tools to reduce the variance

00:24:10.100 | of the gradients, you do that.

00:24:11.900 | There's lots of commonality.

00:24:13.780 | You use the same neural net in both cases.

00:24:16.260 | You compute the gradient, you apply Adam in both cases.

00:24:19.020 | So, I mean, there's lots in common for sure,

00:24:24.220 | but there are some small differences

00:24:26.860 | which are not completely insignificant.

00:24:28.900 | It's really just a matter of your point of view,

00:24:30.940 | what frame of reference you,

00:24:32.660 | how much do you want to zoom in or out

00:24:34.980 | as you look at these problems.

00:24:37.220 | - Which problem do you think is harder?

00:24:39.780 | So people like Noam Chomsky believe

00:24:41.620 | that language is fundamental to everything.

00:24:43.940 | So it underlies everything.

00:24:45.660 | Do you think language understanding

00:24:48.060 | is harder than visual scene understanding or vice versa?

00:24:51.620 | - I think that asking if a problem is hard

00:24:54.620 | is slightly wrong.

00:24:56.220 | I think the question is a little bit wrong

00:24:57.500 | and I want to explain why.

00:24:59.460 | - So what does it mean for a problem to be hard?

00:25:02.620 | Okay, the non-interesting, dumb answer to that

00:25:07.220 | is there's a benchmark

00:25:10.700 | and there's a human level performance on that benchmark.

00:25:13.660 | And how is the effort required

00:25:16.660 | to reach the human level benchmark?

00:25:19.060 | - So from the perspective of how much

00:25:20.620 | until we get to human level on a very good benchmark.

00:25:25.260 | - Yeah, I understand what you mean by that.

00:25:28.900 | So what I was going to say that a lot of it depends on,

00:25:32.060 | you know, once you solve a problem, it stops being hard.

00:25:34.060 | And that's always true.

00:25:36.020 | And so whether something is hard or not

00:25:37.780 | depends on what our tools can do today.

00:25:39.740 | So, you know, you say today, true human level,

00:25:43.700 | language understanding and visual perception

00:25:45.780 | are hard in the sense that there is no way

00:25:48.900 | of solving the problem completely in the next three months.

00:25:51.980 | So I agree with that statement.

00:25:53.900 | Beyond that, I'm just,

00:25:55.420 | my guess would be as good as yours, I don't know.

00:25:57.700 | - Okay, so you don't have a fundamental intuition

00:26:00.340 | about how hard language understanding is.

00:26:02.780 | - Well, I think, I know I changed my mind.

00:26:04.300 | I'd say language is probably going to be harder.

00:26:06.780 | I mean, it depends on how you define it.

00:26:09.180 | Like if you mean absolute top-notch,

00:26:11.220 | 100% language understanding, I'll go with language.

00:26:13.980 | - And so-

00:26:16.140 | - But then if I show you a piece of paper

00:26:17.980 | with letters on it, is that, you see what I mean?

00:26:21.340 | It's like you have a vision system,

00:26:22.620 | you say it's the best human level vision system.

00:26:25.100 | I show you, I open a book and I show you letters.

00:26:28.780 | Will it understand how these letters form

00:26:30.420 | into word and sentences and meaning?

00:26:32.260 | Is this part of the vision problem?

00:26:33.700 | Where does vision end and language begin?

00:26:36.100 | - Yeah, so Chomsky would say it starts at language.

00:26:38.220 | So vision is just a little example

00:26:39.860 | of the kind of structure and fundamental hierarchy

00:26:44.860 | of ideas that's already represented in our brain somehow

00:26:49.060 | that's represented through language.

00:26:51.380 | But where does vision stop and language begin?

00:26:56.380 | That's a really interesting question.

00:27:07.740 | - So one possibility is that it's impossible

00:27:09.900 | to achieve really deep understanding

00:27:12.300 | in either images or language

00:27:15.580 | without basically using the same kind of system.

00:27:18.420 | So you're going to get the other for free.

00:27:20.620 | - I think it's pretty likely that yes,

00:27:23.100 | if we can get one, our machine learning

00:27:25.380 | is probably that good that we can get the other.

00:27:27.340 | But I'm not 100% sure.

00:27:30.180 | And also, I think a lot of it really does depend

00:27:34.540 | on your definitions.

00:27:36.700 | - Definitions of?

00:27:38.020 | - Like perfect vision.

00:27:40.020 | Because reading is vision, but should it count?

00:27:43.300 | - Yeah, to me, so my definition is

00:27:46.580 | if a system looked at an image

00:27:48.820 | and then a system looked at a piece of text

00:27:52.220 | and then told me something about that

00:27:56.020 | and I was really impressed.

00:27:57.460 | - That's relative.

00:27:59.460 | You'll be impressed for half an hour

00:28:01.260 | and then you're gonna say, well,

00:28:02.260 | I mean, all the systems do that,

00:28:03.420 | but here's the thing they don't do.

00:28:05.180 | - Yeah, but I don't have that with humans.

00:28:07.100 | Humans continue to impress me.

00:28:08.900 | - Is that true?

00:28:09.740 | - Well, the ones, okay, so I'm a fan of monogamy,

00:28:14.020 | so I like the idea of marrying somebody,

00:28:16.020 | being with them for several decades.

00:28:18.100 | So I believe in the fact that yes,

00:28:20.020 | it's possible to have somebody continuously giving you

00:28:22.980 | pleasurable, interesting, witty, new ideas, friends.

00:28:28.620 | Yeah, I think so.

00:28:29.980 | They continue to surprise you.

00:28:32.100 | - The surprise, it's that injection of randomness

00:28:37.100 | seems to be a nice source of, yeah,

00:28:42.940 | continued inspiration, like the wit, the humor.

00:28:48.780 | I think, yeah, that would be,

00:28:53.700 | it's a very subjective test,

00:28:55.020 | but I think if you have enough humans in the room.

00:28:58.620 | - Yeah, I understand what you mean.

00:29:00.580 | Yeah, I feel like I misunderstood

00:29:02.140 | what you meant by impressing you.

00:29:03.100 | I thought you meant to impress you with its intelligence,

00:29:06.580 | with how well it understands an image.

00:29:10.220 | I thought you meant something like,

00:29:11.740 | I'm gonna show it a really complicated image

00:29:13.300 | and it's gonna get it right, and you're gonna say, wow,

00:29:15.140 | that's really cool.

00:29:15.980 | Our systems of January 2020 have not been doing that.

00:29:19.980 | - Yeah, no, I think it all boils down to

00:29:22.300 | like the reason people click like on stuff on the internet,

00:29:26.140 | which is like, it makes them laugh.

00:29:28.380 | So it's like humor or wit or insight.

00:29:32.780 | - I'm sure we'll get that as well.

00:29:35.460 | - So forgive the romanticized question,

00:29:38.220 | but looking back to you,

00:29:40.500 | what is the most beautiful or surprising idea

00:29:43.180 | in deep learning or AI in general you've come across?

00:29:46.860 | - So I think the most beautiful thing about deep learning

00:29:49.260 | is that it actually works.

00:29:51.740 | And I mean it, because you got these ideas,

00:29:53.220 | you got the little neural network,

00:29:54.740 | you got the back propagation algorithm.

00:29:57.700 | And then you got some theories as to, you know,

00:30:00.660 | this is kind of like the brain.

00:30:02.060 | So maybe if you make it large,

00:30:03.620 | if you make the neural network large

00:30:04.860 | and you train it on a lot of data,

00:30:05.940 | then it will do the same function that the brain does.

00:30:09.700 | And it turns out to be true, that's crazy.

00:30:12.500 | And now we just train these neural networks

00:30:14.180 | and you make them larger and they keep getting better.

00:30:16.700 | And I find it unbelievable.

00:30:17.900 | I find it unbelievable that this whole AI stuff

00:30:20.620 | with neural networks works.

00:30:22.500 | - Have you built up an intuition of why

00:30:24.980 | are there little bits and pieces of intuitions,

00:30:27.980 | of insights of why this whole thing works?

00:30:31.380 | - I mean, some definitely.

00:30:33.260 | While we know that optimization,

00:30:35.060 | we now have good, you know,

00:30:36.660 | we've had lots of empirical,

00:30:40.620 | you know, huge amounts of empirical reasons

00:30:42.380 | to believe that optimization should work

00:30:44.300 | on most problems we care about.

00:30:46.260 | - Do you have insights of why,

00:30:48.700 | so you just said empirical evidence.

00:30:50.780 | Is most of your,

00:30:54.820 | sort of empirical evidence kind of convinces you,

00:30:58.420 | it's like evolution is empirical.

00:31:00.380 | It shows you that, look, this evolutionary process

00:31:02.940 | seems to be a good way to design organisms

00:31:06.420 | that survive in their environment.

00:31:08.280 | But it doesn't really get you to the insights

00:31:11.420 | of how the whole thing works.

00:31:13.980 | - I think a good analogy is physics.

00:31:16.500 | You know how you say, hey,

00:31:17.580 | let's do some physics calculation

00:31:19.060 | and come up with some new physics theory

00:31:20.540 | and make some prediction.

00:31:21.780 | But then you got around the experiment.

00:31:23.980 | You know, you got around the experiment, it's important.

00:31:26.100 | So it's a bit the same here,

00:31:27.460 | except that maybe sometimes the experiment

00:31:29.780 | came before the theory.

00:31:31.060 | But it still is the case.

00:31:32.100 | You know, you have some data

00:31:33.860 | and you come up with some prediction.

00:31:35.020 | You say, yeah, let's make a big neural network.

00:31:36.580 | Let's train it.

00:31:37.420 | And it's going to work much better than anything before it.

00:31:39.860 | And it will in fact continue to get better

00:31:41.460 | as you make it larger.

00:31:42.740 | And it turns out to be true.

00:31:43.620 | That's amazing when a theory is validated like this.

00:31:46.980 | You know, it's not a mathematical theory.

00:31:48.780 | It's more of a biological theory almost.

00:31:51.740 | So I think there are not terrible analogies

00:31:53.980 | between deep learning and biology.

00:31:55.580 | I would say it's like the geometric mean

00:31:57.540 | of biology and physics.

00:31:58.780 | That's deep learning.

00:32:00.260 | - The geometric mean of biology and physics.

00:32:03.860 | I think I'm going to need a few hours

00:32:05.140 | to wrap my head around that.

00:32:06.540 | 'Cause just to find the geometric,

00:32:10.460 | just to find the set of what biology represents.

00:32:15.460 | - Well, biology, in biology,

00:32:18.020 | things are really complicated.

00:32:19.460 | The theories are really, really,

00:32:21.020 | it's really hard to have good predictive theory.

00:32:22.820 | And in physics, the theories are too good.

00:32:25.380 | In physics, people make these super precise theories

00:32:27.900 | which make these amazing predictions.

00:32:29.340 | And in machine learning, we're kind of in between.

00:32:31.460 | - Kind of in between, but it'd be nice

00:32:33.820 | if machine learning somehow helped us discover

00:32:36.460 | the unification of the two

00:32:37.740 | as opposed to sort of the in between.

00:32:39.540 | But you're right.

00:32:42.100 | You're kind of trying to juggle both.

00:32:44.940 | So do you think there are still beautiful

00:32:46.780 | and mysterious properties in neural networks

00:32:48.820 | that are yet to be discovered?

00:32:50.180 | - Definitely.

00:32:51.380 | I think that we are still massively

00:32:52.900 | underestimating deep learning.

00:32:54.380 | - What do you think it'll look like?

00:32:56.660 | Like what?

00:32:58.260 | - If I knew, I would have done it.

00:32:59.860 | But if you look at all the progress

00:33:04.060 | from the past 10 years, I would say most of it,

00:33:07.060 | I would say there've been a few cases

00:33:08.900 | where things that felt like really new ideas showed up.

00:33:12.900 | But by and large, it was every year we thought,

00:33:15.380 | okay, deep learning goes this far.

00:33:17.220 | Nope, it actually goes further.

00:33:19.020 | And then the next year, okay, now this is big deep learning.

00:33:22.500 | We are really done.

00:33:23.340 | Nope, it goes further.

00:33:24.460 | It just keeps going further each year.

00:33:26.060 | So that means that we keep underestimating,

00:33:27.620 | we keep not understanding it.

00:33:29.180 | It has surprising properties all the time.

00:33:31.420 | - Do you think it's getting harder and harder

00:33:33.620 | to make progress?

00:33:34.460 | - Need to make progress?

00:33:36.020 | - It depends on what you mean.

00:33:36.860 | I think the field will continue to make

00:33:37.980 | very robust progress for quite a while.

00:33:41.180 | I think for individual researchers,

00:33:42.820 | especially people who are doing research,

00:33:46.140 | it can be harder because there is a very large number

00:33:48.260 | of researchers right now.

00:33:50.100 | I think that if you have a lot of compute,

00:33:51.820 | then you can make a lot of very interesting discoveries,

00:33:54.740 | but then you have to deal with the challenge

00:33:57.460 | of managing a huge compute cluster to run your experiments.

00:34:02.460 | It's a little bit harder.

00:34:03.300 | - So I'm asking all these questions

00:34:04.940 | that nobody knows the answer to,

00:34:06.460 | but you're one of the smartest people I know,

00:34:08.300 | so I'm gonna keep asking.

00:34:09.500 | So let's imagine all the breakthroughs that happen

00:34:12.900 | in the next 30 years in deep learning.

00:34:15.260 | Do you think most of those breakthroughs can be done

00:34:17.780 | by one person with one computer?

00:34:20.860 | Sort of in the space of breakthroughs,

00:34:23.780 | do you think compute and large efforts will be necessary?

00:34:28.780 | - I mean, I can't be sure.

00:34:33.900 | When you say one computer, you mean how large?

00:34:36.580 | (Lex laughing)

00:34:38.820 | - You're clever.

00:34:40.780 | I mean one GPU.

00:34:42.660 | - I see.

00:34:43.940 | I think it's pretty unlikely.

00:34:47.540 | I think it's pretty unlikely.

00:34:48.700 | I think that there are many...

00:34:51.020 | The stack of deep learning is starting to be quite deep.

00:34:53.780 | If you look at it, you've got all the way from the ideas,

00:34:59.660 | the systems to build the data sets,

00:35:02.180 | the distributed programming,

00:35:04.180 | the building the actual cluster, the GPU programming,

00:35:08.140 | putting it all together.

00:35:08.980 | So now the stack is getting really deep,

00:35:10.580 | and I think it can be quite hard for a single person

00:35:14.100 | to become, to be world-class

00:35:15.660 | in every single layer of the stack.

00:35:17.900 | - What about what like Vladimir Vapnik really insists on

00:35:22.100 | is taking MNIST and trying to learn from very few examples.

00:35:25.980 | So being able to learn more efficiently.

00:35:29.060 | Do you think there'll be breakthroughs in that space

00:35:32.060 | that may not need the huge compute?

00:35:34.860 | - I think there will be a large number of breakthroughs

00:35:37.900 | in general that will not need a huge amount of compute.

00:35:40.620 | So maybe I should clarify that.

00:35:42.100 | I think that some breakthroughs will require a lot of compute

00:35:45.380 | and I think building systems which actually do things

00:35:48.700 | will require a huge amount of compute.

00:35:50.180 | That one is pretty obvious.

00:35:51.340 | If you want to do X and X requires a huge neural net,

00:35:54.700 | you gotta get a huge neural net.

00:35:56.540 | But I think there will be lots of,

00:35:59.340 | I think there is lots of room for very important work

00:36:02.500 | being done by small groups and individuals.

00:36:05.140 | - Can you maybe sort of on the topic of the science

00:36:08.420 | of deep learning, talk about one of the recent papers

00:36:11.980 | that you released, the deep double descent.

00:36:15.700 | Where bigger models and more data hurt.

00:36:18.180 | I think it's a really interesting paper.

00:36:19.660 | Can you describe the main idea?

00:36:22.340 | - Yeah, definitely.

00:36:23.580 | So what happened is that some, over the years,

00:36:27.020 | some small number of researchers noticed that

00:36:29.580 | it is kind of weird that when you make

00:36:30.780 | the neural network larger, it works better

00:36:32.180 | and it seems to go in contradiction with statistical ideas.

00:36:34.660 | And then some people made an analysis showing

00:36:36.940 | that actually you got this double descent bump.

00:36:38.940 | And what we've done was to show that double descent occurs

00:36:42.780 | for pretty much all practical deep learning systems.

00:36:46.420 | And that it'll be also, so can you step back?

00:36:49.940 | What's the X axis and the Y axis of a double descent plot?

00:36:55.980 | Okay, great.

00:36:57.020 | So you can look, you can do things like,

00:37:02.020 | you can take a neural network

00:37:04.100 | and you can start increasing its size slowly

00:37:07.620 | while keeping your dataset fixed.

00:37:10.020 | So if you increase the size of the neural network slowly,

00:37:14.780 | and if you don't do early stopping,

00:37:16.900 | that's a pretty important detail.

00:37:19.020 | Then when the neural network is really small,

00:37:22.500 | you make it larger,

00:37:23.580 | you get a very rapid increase in performance.

00:37:26.060 | Then you continue to make it larger.

00:37:27.300 | And at some point performance will get worse.

00:37:30.180 | And it gets the worst exactly at the point

00:37:34.020 | at which it achieves zero training error,

00:37:36.260 | precisely zero training loss.

00:37:38.660 | And then as you make it larger, it starts to get better again.

00:37:41.500 | And it's kind of counterintuitive

00:37:42.820 | because you'd expect deep learning phenomena

00:37:44.580 | to be monotonic.

00:37:46.820 | And it's hard to be sure what it means,

00:37:50.020 | but it also occurs in the case of linear classifiers.

00:37:53.140 | And the intuition basically boils down to the following.

00:37:55.940 | When you have a large dataset and a small model,

00:38:02.020 | then small, tiny random, so basically what is overfitting?

00:38:07.100 | Overfitting is when your model is somehow very sensitive

00:38:11.980 | to the small, random, unimportant stuff in your dataset.

00:38:16.060 | - In the training dataset.

00:38:16.980 | - In the training dataset, precisely.

00:38:18.980 | So if you have a small model and you have a big dataset,

00:38:23.380 | and there may be some random thing,

00:38:24.780 | some training cases are randomly in the dataset

00:38:27.460 | and others may not be there,

00:38:29.100 | but the small model is kind of insensitive

00:38:31.660 | to this randomness because there is pretty much

00:38:35.260 | no uncertainty about the model.

00:38:37.060 | When the dataset is large.

00:38:38.340 | So, okay, so at the very basic level to me,

00:38:41.180 | it is the most surprising thing

00:38:43.340 | that neural networks don't overfit every time,

00:38:48.340 | very quickly, before ever being able to learn anything.

00:38:53.500 | There are a huge number of parameters.

00:38:56.300 | So here is, so there is one way, okay,

00:38:57.660 | so maybe, so let me try to give the explanation

00:39:00.220 | and maybe that will be, that will work.

00:39:02.020 | So you've got a huge neural network.

00:39:03.620 | Let's suppose you've got a,

00:39:04.980 | you have a huge neural network,

00:39:07.660 | you have a huge number of parameters.

00:39:09.780 | Now let's pretend everything is linear,

00:39:11.380 | which is not, let's just pretend.

00:39:13.100 | Then there is this big subspace

00:39:15.540 | where your neural network achieves zero error.

00:39:18.060 | And SGD is going to find approximately the point--

00:39:21.220 | - Stochastic gradient, that's right.

00:39:22.620 | - Approximately the point with the smallest norm

00:39:24.500 | in that subspace.

00:39:25.500 | - Okay.

00:39:27.540 | - And that can also be proven to be insensitive

00:39:30.260 | to the small randomness in the data

00:39:33.500 | when the dimensionality is high.

00:39:35.380 | But when the dimensionality of the data

00:39:37.220 | is equal to the dimensionality of the model,

00:39:39.380 | then there is a one-to-one correspondence

00:39:41.060 | between all the datasets and the models.

00:39:44.420 | So small changes in the dataset

00:39:45.700 | actually lead to large changes in the model

00:39:47.380 | and that's why performance gets worse.

00:39:48.860 | So this is the best explanation, more or less.

00:39:51.100 | - So then it would be good for the model

00:39:54.020 | to have more parameters, so to be bigger than the data.

00:39:58.660 | - That's right, but only if you don't early stop.

00:40:00.860 | If you introduce early stop in your regularization,

00:40:02.860 | you can make the double descent bump

00:40:04.660 | almost completely disappear.

00:40:06.140 | - What is early stop?

00:40:07.140 | - Early stop is when you train your model

00:40:09.980 | and you monitor your validation performance.

00:40:12.820 | And then if at some point,

00:40:14.540 | validation performance starts to get worse,

00:40:15.980 | you say, "Okay, let's stop training.

00:40:17.620 | "We are good, we are good enough."

00:40:20.060 | - So the magic happens after that moment,

00:40:23.220 | so you don't wanna do the early stopping.

00:40:25.100 | - Well, if you don't do the early stopping,

00:40:26.700 | you get the very pronounced double descent.

00:40:30.780 | - Do you have any intuition why this happens?

00:40:33.500 | - Double descent?

00:40:34.340 | Oh, sorry, early stopping?

00:40:35.540 | - No, the double descent.

00:40:37.180 | - Well, yeah, so I try, let's see.

00:40:38.860 | The intuition is basically, is this,

00:40:41.260 | that when the dataset has as many degrees of freedom

00:40:45.660 | as the model, then there is a one-to-one correspondence

00:40:49.100 | between them and so small changes to the dataset

00:40:52.180 | lead to noticeable changes in the model.

00:40:55.100 | So your model is very sensitive to all the randomness.

00:40:57.340 | It is unable to discard it,

00:40:59.620 | whereas it turns out that when you have

00:41:02.940 | a lot more data than parameters

00:41:04.700 | or a lot more parameters than data,

00:41:06.660 | the resulting solution will be insensitive

00:41:08.900 | to small changes in the dataset.

00:41:10.540 | - Oh, so it's able to, that's nicely put,

00:41:13.540 | discard the small changes, the randomness.

00:41:16.500 | - Exactly, the spurious correlation which you don't want.

00:41:20.580 | - Jeff Hinton suggested we need to throw back propagation.

00:41:23.540 | We already kind of talked about this a little bit,

00:41:25.260 | but he suggested that we need to throw away

00:41:27.220 | back propagation and start over.

00:41:29.820 | I mean, of course, some of that is a little bit

00:41:32.220 | wit and humor, but what do you think,

00:41:36.580 | what could be an alternative method

00:41:38.020 | of training neural networks?

00:41:39.640 | - Well, the thing that he said precisely is that

00:41:42.180 | to the extent that you can't find back propagation

00:41:44.100 | in the brain, it's worth seeing if we can learn something

00:41:47.680 | from how the brain learns, but back propagation

00:41:49.940 | is very useful and we should keep using it.

00:41:52.420 | - Oh, you're saying that once we discover

00:41:54.580 | the mechanism of learning in the brain

00:41:56.360 | or any aspects of that mechanism,

00:41:58.140 | we should also try to implement that in neural networks?

00:42:00.660 | - If it turns out that we can't find back propagation

00:42:02.940 | in the brain.

00:42:03.780 | - If we can't find back propagation in the brain.

00:42:06.180 | Well, so I guess your answer to that is

00:42:11.860 | back propagation is pretty damn useful,

00:42:13.900 | so why are we complaining?

00:42:16.020 | - I mean, I personally am a big fan of back propagation.

00:42:18.460 | I think it's a great algorithm because it solves

00:42:20.380 | an extremely fundamental problem, which is

00:42:23.100 | finding a neural circuit subject to some constraints.

00:42:27.880 | I don't see that problem going away,

00:42:30.440 | so that's why I really, I think it's pretty unlikely

00:42:35.000 | that we'll have anything which is going to be

00:42:37.360 | dramatically different.

00:42:38.680 | It could happen, but I wouldn't bet on it right now.

00:42:41.420 | - So let me ask a sort of big picture question.

00:42:46.840 | Do you think neural networks can be made to reason?

00:42:51.720 | - Why not?

00:42:53.380 | - Well, if you look, for example, at AlphaGo or AlphaZero,

00:42:56.880 | the neural network of AlphaZero plays Go,

00:43:01.740 | which we all agree is a game that requires reasoning,

00:43:05.020 | better than 99.9% of all humans.

00:43:08.540 | Just the neural network, without the search,

00:43:10.300 | just the neural network itself.

00:43:12.300 | Doesn't that give us an existence proof

00:43:15.140 | that neural networks can reason?

00:43:16.740 | - To push back and disagree a little bit,

00:43:19.560 | we all agree that Go is reasoning.

00:43:22.200 | I think I agree.

00:43:24.820 | I don't think it's a trivial, so obviously,

00:43:26.820 | reasoning, like intelligence, is a loose,

00:43:30.640 | gray area term, a little bit.

00:43:32.600 | Maybe you disagree with that.

00:43:34.040 | But yes, I think it has some of the same elements

00:43:38.020 | of reasoning.

00:43:39.380 | Reasoning is almost akin to search, right?

00:43:43.180 | There's a sequential element of step-wise consideration

00:43:49.180 | of possibilities, and sort of building on top

00:43:53.320 | of those possibilities in a sequential manner

00:43:55.240 | until you arrive at some insight.

00:43:57.640 | So yeah, I guess playing Go is kind of like that.

00:44:00.520 | And when you have a single neural network doing that

00:44:02.840 | without search, it's kind of like that.

00:44:04.920 | So there's an existence proof in a particular

00:44:06.760 | constrained environment that a process akin to

00:44:11.000 | what many people call reasoning exists.

00:44:13.920 | But more general kind of reasoning.

00:44:17.180 | - So off the board.

00:44:18.880 | - There is one other existence proof.

00:44:20.440 | - Oh boy, which one?

00:44:22.160 | Us humans?

00:44:23.000 | - Yes.

00:44:23.820 | - Okay.

00:44:24.660 | All right, so do you think the architecture

00:44:28.840 | that will allow neural networks to reason

00:44:33.400 | will look similar to the neural network architectures

00:44:37.400 | we have today?

00:44:38.880 | - I think it will.

00:44:39.700 | I think, well, I don't wanna make

00:44:41.760 | two overly definitive statements.

00:44:44.080 | I think it's definitely possible that

00:44:46.680 | the neural networks that will produce

00:44:48.520 | the reasoning breakthroughs of the future

00:44:50.240 | will be very similar to the architectures that exist today.

00:44:53.640 | Maybe a little bit more recurrent,

00:44:55.360 | maybe a little bit deeper.

00:44:57.100 | But these neural nets are so insanely powerful.

00:45:02.100 | Why wouldn't they be able to learn to reason?

00:45:05.560 | Humans can reason, so why can't neural networks?

00:45:09.320 | So do you think the kind of stuff we've seen

00:45:11.640 | neural networks do is a kind of just weak reasoning?

00:45:14.660 | So it's not a fundamentally different process?

00:45:16.600 | Again, this is stuff nobody knows the answer to.

00:45:19.680 | - So when it comes to our neural networks,

00:45:23.000 | the thing which I would say is that

00:45:24.720 | neural networks are capable of reasoning.

00:45:27.240 | But if you train a neural network on a task

00:45:30.560 | which doesn't require reasoning, it's not going to reason.

00:45:34.020 | This is a well-known effect where the neural network

00:45:36.360 | will solve exactly the, it will solve the problem

00:45:39.320 | that you pose in front of it in the easiest way possible.

00:45:44.440 | - Right, that takes us to the,

00:45:47.140 | to one of the brilliant sort of ways

00:45:51.560 | you've described neural networks, which is,

00:45:54.220 | you've referred to neural networks

00:45:55.480 | as the search for small circuits.

00:45:57.920 | And maybe general intelligence

00:46:01.180 | as the search for small programs,

00:46:03.360 | which I found as a metaphor very compelling.

00:46:06.960 | Can you elaborate on that difference?

00:46:09.200 | - Yeah, so the thing which I said precisely was that

00:46:13.720 | if you can find the shortest program

00:46:17.280 | that outputs the data at your disposal,

00:46:20.920 | then you will be able to use it

00:46:22.260 | to make the best prediction possible.

00:46:24.260 | And that's a theoretical statement

00:46:27.000 | which can be proved mathematically.

00:46:29.240 | Now, you can also prove mathematically that it is,

00:46:32.440 | that finding the shortest program

00:46:33.920 | which generates some data is not a computable operation.

00:46:38.920 | No finite amount of compute can do this.

00:46:42.760 | So then with neural networks,

00:46:46.080 | neural networks are the next best thing

00:46:47.940 | that actually works in practice.

00:46:50.160 | We are not able to find the best,

00:46:52.880 | the shortest program which generates our data,

00:46:55.760 | but we are able to find, you know, a small,

00:46:58.880 | but now that statement should be amended,

00:47:01.620 | even a large circuit which fits our data in some way.

00:47:05.320 | - Well, I think what you meant by the small circuit

00:47:07.200 | is the smallest needed circuit.

00:47:10.000 | - Well, the thing which I would change now,

00:47:12.360 | back then I really haven't fully internalized

00:47:14.800 | the over-parameterized results.

00:47:17.080 | The things we know about over-parameterized neural nets,

00:47:20.480 | now I would phrase it as a large circuit

00:47:23.200 | whose weights contain a small amount of information,

00:47:27.800 | which I think is what's going on.

00:47:29.200 | If you imagine the training process of a neural network

00:47:31.520 | as you slowly transmit entropy

00:47:33.800 | from the dataset to the parameters,

00:47:37.080 | then somehow the amount of information in the weights

00:47:41.080 | ends up being not very large,

00:47:42.960 | which would explain why they generalize so well.

00:47:45.240 | - So that's, the large circuit might be one that's helpful

00:47:49.400 | for the generalization.

00:47:51.960 | - Yeah, something like this.

00:47:53.360 | - But do you see it important to be able to try

00:47:59.680 | to learn something like programs?

00:48:02.480 | - I mean, if we can, definitely.

00:48:04.880 | I think it's kind of, the answer is kind of yes,

00:48:08.200 | if we can do it.

00:48:09.160 | We should do things that we can't do it.

00:48:11.200 | It's the reason we are pushing on deep learning,

00:48:14.140 | the fundamental reason, the root cause

00:48:18.840 | is that we are able to train them.

00:48:20.540 | So in other words, training comes first.

00:48:23.920 | We've got our pillar, which is the training pillar.

00:48:27.560 | And now we are trying to contort our neural networks

00:48:30.040 | around the training pillar.

00:48:30.920 | We gotta stay trainable.

00:48:31.960 | This is an invariant we cannot violate.

00:48:36.440 | And so being trainable means starting from scratch,

00:48:40.600 | knowing nothing, you can actually pretty quickly

00:48:42.880 | converge towards knowing a lot or even slowly.

00:48:45.920 | But it means that given the resources at your disposal,

00:48:49.540 | you can train the neural net

00:48:52.440 | and get it to achieve useful performance.

00:48:55.440 | - Yeah, that's a pillar we can't move away from.

00:48:57.480 | - That's right, because if you can,

00:48:58.520 | and whereas if you say, hey, let's find the shortest program,

00:49:01.480 | well, we can't do that.

00:49:02.840 | So it doesn't matter how useful that would be.

00:49:06.080 | We can do it, so we won't.

00:49:08.480 | - So do you think, you kind of mentioned

00:49:09.920 | that neural networks are good at finding small circuits

00:49:12.240 | or large circuits.

00:49:13.420 | Do you think then the matter of finding small programs

00:49:17.560 | is just the data?

00:49:19.320 | - No.

00:49:20.160 | - So the, sorry, not the size or character,

00:49:23.880 | the type of data.

00:49:25.920 | Sort of ask giving it programs.

00:49:28.980 | - Well, I think the thing is that right now,

00:49:32.000 | finding, there are no good precedents

00:49:34.600 | of people successfully finding programs really well.

00:49:38.960 | And so the way you'd find programs

00:49:40.680 | is you'd train a deep neural network to do it basically.

00:49:44.360 | - Right.

00:49:45.200 | - Which is the right way to go about it.

00:49:48.160 | - But there's not good illustrations of that.

00:49:50.720 | - It hasn't been done yet,

00:49:51.920 | but in principle, it should be possible.

00:49:54.320 | - Can you elaborate a little bit?

00:49:58.240 | What's your insight in principle?

00:49:59.880 | And put another way, you don't see why it's not possible.

00:50:04.200 | - Well, it's kind of like more, it's more a statement of,

00:50:07.920 | I think that it's unwise to bet against deep learning.

00:50:13.440 | And if it's a cognitive function

00:50:16.960 | that humans seem to be able to do,

00:50:18.720 | then it doesn't take too long for some deep neural net

00:50:23.240 | to pop up that can do it too.

00:50:24.680 | - Yeah, I'm there with you.

00:50:27.840 | I've stopped betting against neural networks at this point

00:50:33.160 | because they continue to surprise us.

00:50:35.720 | What about long-term memory?

00:50:37.280 | Can neural networks have long-term memory

00:50:39.000 | or something like knowledge basis?

00:50:42.200 | So being able to aggregate important information

00:50:45.520 | over long periods of time that would then serve

00:50:49.400 | as useful sort of representations of state

00:50:54.400 | that you can make decisions by.

00:50:57.760 | So have a long-term context

00:50:59.560 | based on what you make in the decision.

00:51:01.600 | - So in some sense, the parameters already do that.

00:51:04.840 | The parameters are an aggregation of the day,

00:51:07.920 | of the neural, of the entirety of the neural experience.

00:51:10.920 | And so they count as the long, as long-term knowledge.

00:51:14.280 | And people have trained various neural nets

00:51:17.800 | to act as knowledge bases and, you know,

00:51:20.200 | investigated with invest,

00:51:21.520 | people have investigated language models as knowledge basis.

00:51:23.720 | So there is work, there is work there.

00:51:27.320 | - Yeah, but in some sense, do you think in every sense,

00:51:29.880 | do you think there's a, it's all just a matter

00:51:34.880 | of coming up with a better mechanism

00:51:36.720 | of forgetting the useless stuff

00:51:38.440 | and remembering the useful stuff?

00:51:40.240 | 'Cause right now, I mean, there's not been mechanisms

00:51:43.080 | that do remember really long-term information.

00:51:46.880 | - What do you mean by that precisely?

00:51:48.880 | - Precisely, I like the word precisely.

00:51:51.760 | So I'm thinking of the kind of compression of information

00:51:58.160 | the knowledge bases represent, sort of creating a,

00:52:02.960 | now, I apologize for my sort of human-centric thinking

00:52:06.920 | about what knowledge is, 'cause neural networks

00:52:10.360 | aren't interpretable necessarily

00:52:12.920 | with the kind of knowledge they have discovered.

00:52:15.800 | But a good example for me is knowledge bases,

00:52:18.740 | being able to build up over time something like

00:52:21.320 | the knowledge that Wikipedia represents.

00:52:24.120 | It's a really compressed, structured,

00:52:27.560 | (scoffs)

00:52:29.760 | knowledge base.

00:52:30.840 | Obviously not the actual Wikipedia or the language,

00:52:34.360 | but like a semantic web,

00:52:35.720 | the dream that semantic web represented.

00:52:37.920 | So it's a really nice compressed knowledge base,

00:52:40.360 | or something akin to that in a non-interpretable sense

00:52:44.560 | as neural networks would have.

00:52:46.980 | - Well, the neural networks would be non-interpretable

00:52:48.560 | if you look at their weights,

00:52:49.440 | but their outputs should be very interpretable.

00:52:52.200 | - Okay, so yeah, how do you make very smart neural networks

00:52:55.840 | like language models interpretable?

00:52:58.080 | - Well, you ask them to generate some text,

00:53:00.280 | and the text will generally be interpretable.

00:53:02.120 | - Do you find that the epitome of interpretability,

00:53:04.720 | like can you do better?

00:53:06.160 | 'Cause you can't, okay, I would like to know

00:53:09.480 | what does it know and what doesn't it know?

00:53:12.240 | I would like the neural network to come up with examples

00:53:15.720 | where it's completely dumb,

00:53:17.960 | and examples where it's completely brilliant.

00:53:20.320 | And the only way I know how to do that now

00:53:22.280 | is to generate a lot of examples and use my human judgment.

00:53:26.480 | But it would be nice if a neural network

00:53:28.200 | had some self-awareness about it.

00:53:31.760 | - Yeah, 100%.

00:53:33.400 | I'm a big believer in self-awareness,

00:53:34.840 | and I think neural net self-awareness

00:53:39.840 | will allow for things like the capabilities,

00:53:42.600 | like the ones you described,

00:53:43.680 | like for them to know what they know

00:53:45.560 | and what they don't know,

00:53:47.040 | and for them to know where to invest

00:53:48.760 | to increase their skills most optimally.

00:53:50.840 | And to your question of interpretability,

00:53:52.280 | there are actually two answers to that question.

00:53:54.360 | One answer is, you know, we have the neural net,

00:53:56.480 | so we can analyze the neurons,

00:53:58.520 | and we can try to understand what the different neurons

00:54:00.640 | and different layers mean.

00:54:01.880 | And you can actually do that,

00:54:03.440 | and OpenAI has done some work on that.

00:54:05.920 | But there is a different answer,

00:54:06.960 | which is that, I would say,

00:54:10.320 | that's the human-centric answer,

00:54:11.400 | where you say, you know, you look at a human being,

00:54:15.040 | you can't read, you know,

00:54:16.520 | how do you know what a human being is thinking?

00:54:18.800 | You ask them, you say, hey, what do you think about this?

00:54:20.640 | What do you think about that?

00:54:22.360 | And you get some answers.

00:54:23.960 | The answers you get are sticky,

00:54:25.640 | in the sense you already have a mental model.

00:54:28.040 | You already have an, yeah,

00:54:30.600 | mental model of that human being.

00:54:32.700 | You already have an understanding of,

00:54:35.160 | like a big conception of what it,

00:54:37.760 | of that human being, how they think,

00:54:39.400 | how what they know, how they see the world,

00:54:41.560 | and then everything you ask, you're adding onto that.

00:54:45.560 | And that stickiness seems to be,

00:54:49.800 | that's one of the really interesting qualities

00:54:51.720 | of the human being, is that information is sticky.

00:54:55.040 | You don't, you seem to remember the useful stuff,

00:54:57.560 | aggregate it well, and forget most of the information

00:55:00.440 | that's not useful.

00:55:01.800 | That process, but that's also pretty similar

00:55:04.920 | to the process that neural networks do.

00:55:06.800 | It's just that neural networks are much crappier

00:55:09.080 | at this time.

00:55:10.680 | It doesn't seem to be fundamentally that different.

00:55:13.280 | But just to stick on reasoning for a little longer,

00:55:16.060 | you said, why not?

00:55:18.760 | Why can't I reason?

00:55:19.720 | What's a good, impressive feat,

00:55:22.840 | benchmark to you of reasoning

00:55:24.800 | that you'll be impressed by

00:55:28.760 | if neural networks were able to do?

00:55:30.640 | Is that something you already have in mind?

00:55:32.880 | - Well, I think writing really good code.

00:55:35.300 | I think proving really hard theorems.

00:55:39.320 | Solving open-ended problems with out-of-the-box solutions.

00:55:43.160 | - And sort of theorem type mathematical problems.

00:55:49.520 | - Yeah, I think those ones are a very natural example

00:55:52.120 | as well.

00:55:52.960 | If you can prove an unproven theorem,

00:55:54.520 | then it's hard to argue, you don't reason.

00:55:56.620 | And so by the way, and this comes back to the point

00:55:59.440 | about the hard results.

00:56:01.000 | If you've got a hard, if you have,

00:56:03.240 | machine learning, deep learning as a field is very fortunate

00:56:06.120 | because we have the ability to sometimes produce

00:56:08.760 | these unambiguous results.

00:56:10.880 | And when they happen, the debate changes,

00:56:13.160 | the conversation changes.

00:56:14.320 | We have the ability to produce conversation changing results.

00:56:19.540 | - Conversation, and then of course, just like you said,

00:56:21.660 | people kind of take that for granted

00:56:23.060 | and say that wasn't actually a hard problem.

00:56:25.100 | Well, I mean, at some point,

00:56:26.420 | you'll probably run out of hard problems.

00:56:28.420 | Yeah, that whole mortality thing is kind of a sticky problem

00:56:33.700 | that we haven't quite figured out.

00:56:35.140 | Maybe we'll solve that one.

00:56:37.240 | I think one of the fascinating things

00:56:39.140 | in your entire body of work,

00:56:40.900 | but also the work at OpenAI recently,

00:56:43.060 | one of the conversation changers has been

00:56:44.860 | in the world of language models.

00:56:47.180 | Can you briefly kind of try to describe the recent history

00:56:51.140 | of using neural networks in the domain of language and text?

00:56:54.660 | - Well, there's been lots of history.

00:56:56.660 | I think the Elman network was a small,

00:57:00.260 | tiny recurrent neural network applied to language

00:57:02.140 | back in the '80s.

00:57:03.900 | So the history is really, you know, fairly long at least.

00:57:08.740 | And the thing that started,

00:57:10.700 | the thing that changed the trajectory

00:57:13.480 | of neural networks and language

00:57:14.980 | is the thing that changed the trajectory

00:57:17.220 | of all deep learning, and that's data and compute.

00:57:19.700 | So suddenly you move from small language models,

00:57:22.740 | which learn a little bit.

00:57:24.420 | And with language models in particular,

00:57:26.660 | there's a very clear explanation

00:57:28.500 | for why they need to be large to be good,

00:57:31.660 | because they're trying to predict the next word.

00:57:34.620 | So when you don't know anything,

00:57:36.900 | you'll notice very, very broad strokes,

00:57:40.260 | surface level patterns,

00:57:41.500 | like sometimes there are characters

00:57:44.860 | and there is a space between those characters.

00:57:46.500 | You'll notice this pattern.

00:57:47.980 | And you'll notice that sometimes there is a comma

00:57:50.020 | and then the next character is a capital letter.

00:57:51.900 | You'll notice that pattern.

00:57:53.620 | Eventually you may start to notice

00:57:54.980 | that there are certain words occur often.

00:57:57.140 | You may notice that spellings are a thing.

00:57:59.380 | You may notice syntax.

00:58:01.060 | And when you get really good at all these,

00:58:03.660 | you start to notice the semantics.

00:58:05.860 | You start to notice the facts.

00:58:07.820 | But for that to happen,

00:58:08.860 | the language model needs to be larger.

00:58:11.460 | - So let's linger on that,

00:58:14.060 | 'cause that's where you and Noam Chomsky disagree.

00:58:16.620 | So you think we're actually taking incremental steps,

00:58:23.700 | sort of larger network, larger compute,

00:58:25.740 | we'll be able to get to the semantics,

00:58:29.540 | be able to understand language

00:58:32.020 | without what Noam likes to sort of think of

00:58:35.540 | as a fundamental understandings

00:58:38.660 | of the structure of language,

00:58:40.460 | like imposing your theory of language

00:58:43.380 | onto the learning mechanism.

00:58:45.900 | So you're saying the learning,

00:58:48.060 | you can learn from raw data,

00:58:50.620 | the mechanism that underlies language.

00:58:53.460 | - Well, I think it's pretty likely,

00:58:56.780 | but I also wanna say that I don't really

00:58:58.820 | know precisely what Chomsky means when he talks about him.

00:59:05.220 | You said something about imposing

00:59:07.380 | your structure on language.

00:59:08.820 | I'm not 100% sure what he means,

00:59:10.540 | but empirically it seems that

00:59:12.740 | when you inspect those larger language models,

00:59:14.700 | they exhibit signs of understanding the semantics,

00:59:16.700 | whereas the smaller language models do not.

00:59:18.540 | We've seen that a few years ago

00:59:19.820 | when we did work on the sentiment neuron,

00:59:21.980 | we trained a small, you know,

00:59:24.060 | smallish LSTM to predict the next character

00:59:27.380 | in Amazon reviews.

00:59:28.620 | And we noticed that when you increase the size of the LSTM

00:59:31.700 | from 500 LSTM cells to 4,000 LSTM cells,

00:59:35.420 | then one of the neurons starts to represent the sentiment

00:59:38.620 | of the article, of, sorry, of the review.

00:59:41.020 | Now, why is that?

00:59:42.980 | Sentiment is a pretty semantic attribute.

00:59:45.260 | It's not a syntactic attribute.

00:59:46.900 | - And for people who might not know,

00:59:48.380 | I don't know if that's a standard term,

00:59:49.460 | but sentiment is whether it's a positive or negative review.

00:59:52.020 | - That's right.

00:59:52.860 | Like, is the person happy with something

00:59:54.300 | or is the person unhappy with something?

00:59:55.940 | And so here we had very clear evidence

00:59:58.780 | that a small neural net does not capture sentiment

01:00:01.940 | while a large neural net does.

01:00:03.620 | And why is that?

01:00:04.740 | Well, our theory is that at some point

01:00:07.460 | you run out of syntax to models,

01:00:08.860 | you start to gotta focus on something else.

01:00:11.060 | - And with size, you quickly run out of syntax to model,

01:00:15.820 | and then you really start to focus on the semantics,

01:00:18.380 | would be the idea.

01:00:19.420 | - That's right.

01:00:20.260 | And so I don't wanna imply that our models

01:00:22.180 | have complete semantic understanding

01:00:23.860 | because that's not true,

01:00:25.340 | but they definitely are showing signs

01:00:28.260 | of semantic understanding, partial semantic understanding,

01:00:30.780 | but the smaller models do not show those signs.

01:00:34.540 | - Can you take a step back and say, what is GPT-2,

01:00:38.180 | which is one of the big language models

01:00:40.580 | that was the conversation changer

01:00:42.540 | in the past couple of years?

01:00:43.820 | - Yeah, so GPT-2 is a transformer

01:00:48.180 | with one and a half billion parameters

01:00:50.380 | that was trained on about 40 billion tokens of text,

01:00:55.380 | which were obtained from web pages

01:00:58.900 | that were linked to from Reddit articles

01:01:01.140 | with more than three upvotes.

01:01:02.380 | - And what's a transformer?

01:01:03.940 | - The transformer, it's the most important advance

01:01:06.740 | in neural network architectures in recent history.

01:01:09.820 | - What is attention maybe too?

01:01:11.540 | 'Cause I think that's an interesting idea,

01:01:13.300 | not necessarily sort of technically speaking,

01:01:15.060 | but the idea of attention

01:01:17.500 | versus maybe what recurring neural networks represent.

01:01:21.140 | - Yeah, so the thing is the transformer

01:01:23.380 | is a combination of multiple ideas simultaneously

01:01:25.900 | of which attention is one.

01:01:28.180 | - Do you think attention is the key?

01:01:29.420 | - No, it's a key, but it's not the key.

01:01:32.500 | The transformer is successful

01:01:34.540 | because it is the simultaneous combination

01:01:36.820 | of multiple ideas.

01:01:37.740 | And if you were to remove either idea,

01:01:39.100 | it would be much less successful.

01:01:41.500 | So the transformer uses a lot of attention,

01:01:43.900 | but attention existed for a few years.

01:01:45.900 | So that can't be the main innovation.

01:01:48.460 | The transformer is designed in such a way

01:01:53.220 | that it runs really fast on the GPU.

01:01:55.220 | And that makes a huge amount of difference.

01:01:58.220 | This is one thing.

01:01:59.400 | The second thing is that transformer is not recurrent.

01:02:02.880 | And that is really important too,

01:02:04.720 | because it is more shallow

01:02:06.400 | and therefore much easier to optimize.

01:02:08.480 | So in other words, it uses attention.

01:02:10.440 | It is a really great fit to the GPU

01:02:14.320 | and it is not recurrent,

01:02:15.360 | so therefore less deep and easier to optimize.

01:02:17.840 | And the combination of those factors make it successful.

01:02:20.760 | So now it makes great use of your GPU.

01:02:24.240 | It allows you to achieve better results

01:02:26.400 | for the same amount of compute.

01:02:28.720 | And that's why it's successful.

01:02:31.080 | - Were you surprised how well transformers worked

01:02:34.200 | and GPT-2 worked?

01:02:36.120 | So you worked on language.

01:02:37.840 | You've had a lot of great ideas

01:02:39.760 | before transformers came about in language.

01:02:42.880 | So you got to see the whole set of revolutions

01:02:44.960 | before and after.

01:02:46.160 | Were you surprised?

01:02:47.560 | - Yeah, a little.

01:02:48.640 | - A little?

01:02:49.480 | - Yeah.

01:02:50.320 | I mean, it's hard to remember

01:02:51.920 | because you adapt really quickly,

01:02:54.520 | but it definitely was surprising.

01:02:55.960 | It definitely was.

01:02:56.880 | In fact, you know what?

01:02:59.040 | I'll retract my statement.

01:03:00.480 | It was pretty amazing.

01:03:02.480 | It was just amazing to see generate this text of this.

01:03:06.080 | And you know, you got to keep in mind

01:03:07.360 | that at that time, you've seen all this progress in GANs,

01:03:10.480 | in improving the samples produced by GANs were just amazing.

01:03:14.720 | You have these realistic faces,

01:03:15.960 | but text hasn't really moved that much.

01:03:17.880 | And suddenly we moved from, you know,

01:03:20.520 | whatever GANs were in 2015

01:03:23.120 | to the best, most amazing GANs in one step.

01:03:26.200 | And that was really stunning.

01:03:27.520 | Even though theory predicted,

01:03:29.040 | yeah, you train a big language model,

01:03:30.440 | of course you should get this.

01:03:31.840 | But then to see it with your own eyes, it's something else.

01:03:34.880 | - And yet we adapt really quickly.

01:03:37.240 | And now there's sort of some cognitive scientists

01:03:42.240 | write articles saying that GPT-2 models

01:03:47.040 | don't truly understand language.

01:03:49.320 | So we adapt quickly to how amazing

01:03:51.880 | the fact that they're able to model the language so well is.

01:03:55.680 | So what do you think is the bar?

01:03:57.920 | - For what?

01:03:59.680 | - For impressing us that it-

01:04:02.400 | - I don't know.

01:04:03.720 | - Do you think that bar will continuously be moved?

01:04:06.080 | - Definitely.

01:04:07.320 | I think when you start to see

01:04:08.840 | really dramatic economic impact, that's when,

01:04:11.960 | I think that's in some sense the next barrier.

01:04:13.800 | Because right now, if you think about the work in AI,

01:04:16.880 | it's really confusing.

01:04:18.880 | It's really hard to know what to make of all these advances.

01:04:22.520 | It's kind of like, okay, you got an advance.

01:04:25.560 | Now you can do more things.

01:04:26.840 | And you got another improvement.

01:04:29.080 | And you got another cool demo.

01:04:30.400 | At some point, I think people who are outside of AI,

01:04:35.400 | they can no longer distinguish this progress anymore.

01:04:38.680 | - So we were talking offline

01:04:40.040 | about translating Russian to English

01:04:41.760 | and how there's a lot of brilliant work in Russian

01:04:44.120 | that the rest of the world doesn't know about.

01:04:46.440 | That's true for Chinese.

01:04:47.560 | It's true for a lot of scientists

01:04:50.080 | and just artistic work in general.

01:04:52.200 | Do you think translation is the place

01:04:53.880 | where we're going to see sort of economic big impact?

01:04:57.080 | - I don't know.

01:04:58.080 | I think there is a huge number of,

01:05:00.040 | I mean, first of all, I would want to,

01:05:01.600 | I want to point out that translation already today is huge.

01:05:05.520 | I think billions of people interact

01:05:07.520 | with big chunks of the internet

01:05:09.960 | primarily through translation.

01:05:11.080 | So translation is already huge

01:05:13.040 | and it's hugely, hugely positive too.

01:05:16.400 | I think self-driving is going to be hugely impactful.

01:05:20.320 | And that's, you know, it's unknown exactly when it happens,

01:05:24.480 | but again, I would not bet against deep learning.

01:05:27.040 | So I--

01:05:28.000 | - So that's deep learning in general, but you think--

01:05:30.400 | - Deep learning for self-driving.

01:05:31.960 | - Yes, deep learning for self-driving.

01:05:33.160 | But I was talking about sort of language models.

01:05:35.360 | - I see.

01:05:36.200 | - Just to check. - I veered off a little bit.

01:05:38.120 | - Just to check.

01:05:38.960 | You're not seeing a connection

01:05:40.000 | between driving and language.

01:05:41.160 | - No, no. - Okay.

01:05:42.400 | - Or rather, both use neural nets.

01:05:44.080 | - That'd be a poetic connection.

01:05:45.600 | I think there might be some, like you said,

01:05:47.800 | there might be some kind of unification

01:05:49.200 | towards a kind of multitask transformers

01:05:54.200 | that can take on both language and vision tasks.

01:05:58.240 | That'd be an interesting unification.

01:06:01.440 | Now let's see, what can I ask about GPT-2 more?

01:06:04.000 | - It's simple, so not much to ask.

01:06:06.980 | It's, you take a transform, you make it bigger,

01:06:09.960 | give it more data,

01:06:10.800 | and suddenly it does all those amazing things.

01:06:12.700 | - Yeah, one of the beautiful things is that GPT,

01:06:14.920 | the transformers are fundamentally simple

01:06:17.200 | to explain, to train.

01:06:18.660 | Do you think bigger will continue

01:06:23.960 | to show better results in language?

01:06:27.080 | - Probably.

01:06:28.240 | - Sort of like what are the next steps with GPT-2,

01:06:30.520 | do you think?

01:06:31.480 | - I mean, I think for sure seeing what larger versions

01:06:35.680 | can do is one direction.

01:06:37.640 | Also, I mean, there are many questions.

01:06:41.240 | There's one question which I'm curious about,

01:06:42.800 | and that's the following.

01:06:44.000 | So right now, GPT-2,

01:06:45.400 | so we feed it all this data from the internet,

01:06:47.000 | which means that it needs to memorize

01:06:48.160 | all those random facts about everything in the internet.

01:06:51.880 | And it would be nice if the model

01:06:56.120 | could somehow use its own intelligence

01:06:59.200 | to decide what data it wants to accept

01:07:01.840 | and what data it wants to reject.

01:07:03.580 | Just like people,

01:07:04.420 | people don't learn all data indiscriminately.

01:07:07.200 | We are super selective about what we learn.

01:07:09.760 | And I think this kind of active learning,

01:07:11.600 | I think, would be very nice to have.

01:07:13.400 | - Yeah, listen, I love active learning.

01:07:16.760 | So let me ask, does the selection of data,

01:07:21.200 | can you just elaborate that a little bit more?

01:07:23.080 | Do you think the selection of data is,

01:07:26.440 | like, I have this kind of sense

01:07:29.920 | that the optimization of how you select data,

01:07:33.840 | so the active learning process,

01:07:35.920 | is going to be a place for a lot of breakthroughs,

01:07:40.020 | even in the near future,

01:07:42.200 | because there hasn't been many breakthroughs there

01:07:44.120 | that are public.

01:07:45.160 | I feel like there might be private breakthroughs

01:07:47.640 | that companies keep to themselves,

01:07:49.400 | 'cause it's a fundamental problem that has to be solved

01:07:51.560 | if you wanna solve self-driving,

01:07:53.000 | if you wanna solve a particular task.

01:07:55.360 | What do you think about the space in general?

01:07:57.880 | - Yeah, so I think that for something like active learning,

01:08:00.280 | or in fact, for any kind of capability, like active learning,

01:08:03.860 | the thing that it really needs is a problem.

01:08:05.880 | It needs a problem that requires it.

01:08:08.020 | It's very hard to do research about the capability

01:08:12.160 | if you don't have a task,

01:08:13.080 | because then what's going to happen

01:08:14.280 | is that you will come up with an artificial task,

01:08:16.760 | get good results, but not really convince anyone.

01:08:19.780 | - Right, like, we're now past the stage

01:08:23.000 | where getting a result on MNIST,

01:08:27.520 | some clever formulation of MNIST will convince people.

01:08:30.880 | - That's right.

01:08:31.720 | In fact, you could quite easily come up

01:08:33.680 | with a simple active learning scheme on MNIST

01:08:35.400 | and get a 10x speedup, but then, so what?

01:08:39.640 | And I think that with active learning,

01:08:41.880 | the active learning will naturally arise

01:08:45.520 | as problems that require it pop up.

01:08:49.280 | That's my take on it.

01:08:51.880 | - There's another interesting thing

01:08:54.080 | that OpenAI has brought up with GPT-2,

01:08:56.080 | which is when you create

01:08:58.640 | a powerful artificial intelligence system,

01:09:01.400 | and it was unclear what kind of detrimental,

01:09:04.640 | once you release GPT-2,

01:09:07.440 | what kind of detrimental effect it'll have,

01:09:09.560 | because if you have a model

01:09:11.520 | that can generate pretty realistic text,

01:09:14.040 | you can start to imagine that it would be used by bots

01:09:18.280 | in some way that we can't even imagine.

01:09:21.680 | So there's this nervousness about what it's possible to do.

01:09:24.400 | So you did a really kind of brave

01:09:27.080 | and I think profound thing,

01:09:28.120 | which is start a conversation about this.

01:09:29.920 | Like, how do we release

01:09:32.240 | powerful artificial intelligence models to the public?

01:09:36.080 | If we do it all, how do we privately discuss

01:09:39.760 | with other, even competitors,

01:09:42.160 | about how we manage the use of the systems and so on?

01:09:46.040 | So from this whole experience,

01:09:47.960 | you released a report on it,

01:09:49.520 | but in general, are there any insights

01:09:51.760 | that you've gathered from just thinking about this,

01:09:55.320 | about how you release models like this?

01:09:57.680 | - I mean, I think that my take on this

01:10:00.680 | is that the field of AI has been in a state of childhood,

01:10:05.020 | and now it's exiting that state

01:10:06.820 | and it's entering a state of maturity.

01:10:08.720 | What that means is that AI is very successful

01:10:12.300 | and also very impactful,

01:10:14.100 | and its impact is not only large, but it's also growing.

01:10:16.940 | And so for that reason, it seems wise to start thinking

01:10:22.820 | about the impact of our systems before releasing them,

01:10:25.900 | maybe a little bit too soon,

01:10:27.500 | rather than a little bit too late.

01:10:29.660 | And with the case of GPT-2, like I mentioned earlier,

01:10:32.860 | the results really were stunning,

01:10:35.140 | and it seemed plausible.

01:10:37.220 | It didn't seem certain.

01:10:38.700 | It seemed plausible that something like GPT-2

01:10:41.540 | could easily use to reduce the cost of disinformation.

01:10:45.460 | And so there was a question

01:10:48.500 | of what's the best way to release it,

01:10:50.040 | and a staged release seemed logical.

01:10:51.740 | A small model was released,

01:10:53.700 | and there was time to see the...

01:10:56.460 | Many people use these models in lots of cool ways.

01:10:59.720 | There've been lots of really cool applications.

01:11:02.020 | There haven't been any negative applications we know of,

01:11:06.180 | and so eventually it was released.

01:11:07.620 | But also other people replicated similar models.

01:11:09.980 | - That's an interesting question, though, that we know of.

01:11:12.700 | So in your view, staged release

01:11:16.060 | is at least part of the answer to the question of how do we...

01:11:20.780 | What do we do once we create a system like this?

01:11:25.940 | - It's part of the answer, yes.

01:11:27.500 | - Is there any other insights?

01:11:29.980 | Like, say you don't wanna release the model at all

01:11:32.400 | because it's useful to you for whatever the business is.

01:11:35.860 | - Well, plenty of people don't release models already.

01:11:39.100 | - Right, of course,

01:11:39.940 | but is there some moral, ethical responsibility

01:11:44.560 | when you have a very powerful model to sort of communicate?

01:11:47.660 | Just as you said, when you had GPT-2,

01:11:51.380 | it was unclear how much it could be used for misinformation.

01:11:54.180 | It's an open question,

01:11:55.440 | and getting an answer to that might require

01:11:58.660 | that you talk to other really smart people

01:12:00.580 | that are outside of your particular group.

01:12:03.860 | Please tell me there's some optimistic pathway

01:12:08.500 | for people across the world

01:12:10.520 | to collaborate on these kinds of cases.

01:12:12.660 | Or is it still really difficult from one company

01:12:17.820 | to talk to another company?

01:12:19.560 | - So it's definitely possible.

01:12:21.300 | It's definitely possible to discuss these kind of models

01:12:26.140 | with colleagues elsewhere

01:12:28.300 | and to get their take on what to do.

01:12:32.220 | - How hard is it though?

01:12:33.660 | - I mean...

01:12:34.660 | - Do you see that happening?

01:12:38.060 | - I think that's a place where it's important

01:12:40.540 | to gradually build trust between companies

01:12:43.300 | because ultimately, all the AI developers

01:12:47.100 | are building technology

01:12:47.960 | which is going to be increasingly more powerful.

01:12:50.780 | And so it's...

01:12:54.700 | The way to think about it

01:12:55.620 | is that ultimately we're all in it together.

01:12:57.740 | - Yeah, it's...

01:12:59.460 | I tend to believe in the better angels of our nature,

01:13:04.420 | but I do hope that...

01:13:06.860 | That when you build a really powerful AI system

01:13:11.380 | in a particular domain,

01:13:12.860 | that you also think about

01:13:14.180 | the potential negative consequences of...

01:13:17.900 | Yeah.

01:13:18.740 | It's an interesting and scary possibility

01:13:24.060 | that there'll be a race for AI development

01:13:27.340 | that would push people to close that development

01:13:30.420 | and not share ideas with others.

01:13:32.220 | - I don't love this.

01:13:34.620 | I've been a pure academic for 10 years.

01:13:36.660 | I really like sharing ideas and it's fun.

01:13:38.900 | It's exciting.

01:13:39.740 | - What do you think it takes to...

01:13:42.780 | Let's talk about AGI a little bit.

01:13:44.460 | What do you think it takes to build a system

01:13:46.380 | of human level intelligence?

01:13:47.980 | We talked about reasoning.

01:13:49.580 | We talked about long-term memory,

01:13:51.300 | but in general, what does it take, do you think?

01:13:53.700 | - Well, I can't be sure,

01:13:56.220 | but I think the deep learning plus maybe another small idea.

01:14:02.460 | - Do you think self-play will be involved?

01:14:05.620 | So like you've spoken about the powerful mechanism

01:14:08.460 | of self-play where systems learn by

01:14:11.540 | sort of exploring the world in a competitive setting

01:14:16.660 | against other entities that are similarly skilled as them

01:14:20.580 | and so incrementally improve in this way.

01:14:23.060 | Do you think self-play will be a component

01:14:24.580 | of building an AGI system?

01:14:26.700 | - Yeah, so what I would say to build AGI,

01:14:30.340 | I think is going to be deep learning plus some ideas.

01:14:35.060 | And I think self-play will be one of those ideas.

01:14:37.500 | I think that that is a very...

01:14:40.660 | Self-play has this amazing property that it can surprise us

01:14:46.380 | in truly novel ways.

01:14:49.620 | For example, pretty much every self-play system,

01:14:54.620 | both our Dota bot, I don't know if OpenAI had a release

01:15:00.460 | about multi-agent where you had two little agents

01:15:04.380 | who were playing hide and seek.

01:15:06.100 | And of course, also AlphaZero.

01:15:08.260 | They will all produce surprising behaviors.

01:15:11.060 | They all produce behaviors that we didn't expect.

01:15:13.220 | They are creative solutions to problems.

01:15:15.860 | And that seems like an important part of AGI

01:15:18.740 | that our systems don't exhibit routinely right now.

01:15:21.380 | And so that's why I like this area, I like this direction

01:15:25.740 | because of its ability to surprise us.

01:15:27.620 | - To surprise us.

01:15:28.460 | And an AGI system would surprise us fundamentally.

01:15:31.260 | - Yes, and to be precise, not just a random surprise,

01:15:34.580 | but to find a surprising solution to a problem

01:15:37.980 | that's also useful.

01:15:39.220 | - Right.

01:15:40.060 | Now, a lot of the self-play mechanisms have been used

01:15:43.580 | in the game context or at least in a simulation context.

01:15:48.580 | How far along the path to AGI

01:15:53.580 | do you think will be done in simulation?

01:15:56.700 | How much faith, promise do you have in simulation

01:16:01.340 | versus having to have a system that operates

01:16:04.500 | in the real world, whether it's the real world

01:16:07.460 | of digital real-world data or real world,

01:16:10.660 | like actual physical world of robotics?

01:16:13.240 | - I don't think it's an either or.

01:16:15.020 | I think simulation is a tool and it helps.

01:16:17.540 | It has certain strengths and certain weaknesses

01:16:19.700 | and we should use it.

01:16:21.500 | - Yeah, but, okay, I understand that.

01:16:24.460 | That's true, but one of the criticisms of self-play,

01:16:32.740 | one of the criticisms of reinforcement learning

01:16:34.820 | is one of the,

01:16:35.660 | its current power, its current results,

01:16:41.080 | while amazing, have been demonstrated

01:16:42.940 | in a simulated environments

01:16:44.820 | or very constrained physical environments.

01:16:46.420 | Do you think it's possible to escape them,

01:16:49.180 | escape the simulated environments

01:16:50.780 | and be able to learn in non-simulated environments?

01:16:53.420 | Or do you think it's possible to also just simulate

01:16:57.020 | in a photorealistic and physics realistic way,

01:17:01.100 | the real world in a way that we can solve real problems

01:17:03.780 | with self-play in simulation?

01:17:06.740 | - So I think that transfer from simulation

01:17:09.700 | to the real world is definitely possible

01:17:11.700 | and has been exhibited many times

01:17:13.900 | in by many different groups.

01:17:16.060 | It's been especially successful in vision.

01:17:18.660 | Also, OpenAI in the summer has demonstrated a robot hand

01:17:22.660 | which was trained entirely in simulation

01:17:25.260 | in a certain way that allowed

01:17:26.660 | for sim-to-real transfer to occur.

01:17:28.500 | - Is this for the Rubik's Cube?

01:17:31.420 | - Yes, that's right.

01:17:32.660 | - I wasn't aware that was trained in simulation.

01:17:34.660 | - It was trained in simulation entirely.

01:17:37.020 | - Really, so it wasn't in the physics,

01:17:39.420 | the hand wasn't trained?

01:17:40.980 | - No, 100% of the training was done in simulation

01:17:44.820 | and the policy that was learned in simulation

01:17:46.900 | was trained to be very adaptive.

01:17:48.980 | So adaptive that when you transfer it,

01:17:50.940 | it could very quickly adapt to the physical world.

01:17:53.940 | - So the kind of perturbations with the giraffe

01:17:57.380 | or whatever the heck it was,

01:17:58.860 | were those part of the simulation?

01:18:01.860 | - Well, the simulation was generally,

01:18:04.140 | so the simulation was trained to be robust

01:18:07.060 | to many different things,

01:18:08.140 | but not the kind of perturbations we've had in the video.

01:18:10.580 | So it's never been trained with a glove,

01:18:12.660 | it's never been trained with a stuffed giraffe.

01:18:17.060 | - So in theory, these are novel perturbations.

01:18:19.340 | - Correct, it's not in theory, in practice.

01:18:21.740 | - That those are novel perturbations?

01:18:23.780 | Well, that's okay.

01:18:25.100 | That's a clean, small scale, but clean example

01:18:29.460 | of a transfer from the simulated world

01:18:30.820 | to the physical world.

01:18:32.140 | - Yeah, and I will also say that I expect

01:18:34.300 | the transfer capabilities of deep learning

01:18:36.260 | to increase in general.

01:18:38.180 | And the better the transfer capabilities are,

01:18:40.540 | the more useful simulation will become.

01:18:43.500 | Because then you could take,

01:18:45.140 | you could experience something in simulation

01:18:48.420 | and then learn a moral of the story,

01:18:50.220 | which you could then carry with you to the real world.

01:18:53.420 | As humans do all the time when they play computer games.

01:18:55.980 | - So let me ask sort of a embodied question,

01:19:01.620 | staying on AGI for a sec.

01:19:03.460 | Do you think AGI says that we need to have a body?

01:19:07.620 | We need to have some of those human elements

01:19:09.460 | of self-awareness, consciousness,

01:19:12.900 | sort of fear of mortality, sort of self-preservation

01:19:16.580 | in the physical space, which comes with having a body.

01:19:20.260 | - I think having a body will be useful.

01:19:22.340 | I don't think it's necessary.

01:19:24.260 | But I think it's very useful to have a body for sure,

01:19:26.180 | because you can learn a whole new,

01:19:28.820 | you can learn things which cannot be learned without a body.

01:19:32.420 | But at the same time, I think that you can,

01:19:34.420 | if you don't have a body, you could compensate for it

01:19:36.860 | and still succeed.

01:19:38.540 | - You think so?

01:19:39.380 | - Yes.

01:19:40.220 | Well, there is evidence for this.

01:19:41.040 | For example, there are many people

01:19:42.260 | who were born deaf and blind,

01:19:44.260 | and they were able to compensate for the lack of modalities.

01:19:48.180 | I'm thinking about Helen Kaler specifically.

01:19:50.380 | - So even if you're not able to physically interact

01:19:53.780 | with the world, and if you're not able to,

01:19:56.860 | I mean, I actually was getting at,

01:19:58.700 | maybe let me ask on the more particular,

01:20:02.620 | I'm not sure if it's connected to having a body or not,

01:20:05.300 | but the idea of consciousness.

01:20:07.820 | And a more constrained version of that is self-awareness.

01:20:11.220 | Do you think an AGI system should have consciousness?

01:20:14.500 | We can't define consciousness,

01:20:17.300 | whatever the heck you think consciousness is.

01:20:19.380 | - Yeah, hard question to answer,

01:20:21.540 | given how hard it is to define it.

01:20:23.240 | - Do you think it's useful to think about?

01:20:26.420 | - I mean, it's definitely interesting.

01:20:28.340 | It's fascinating.

01:20:29.820 | I think it's definitely possible

01:20:31.780 | that our systems will be conscious.

01:20:33.860 | - Do you think that's an emergent thing

01:20:35.020 | that just comes from,

01:20:36.380 | do you think consciousness could emerge

01:20:37.740 | from the representation that's stored within your networks?

01:20:40.820 | So like that it naturally just emerges

01:20:42.960 | when you become more and more,

01:20:45.080 | you're able to represent more and more of the world?

01:20:47.000 | - Well, I'd say, I'd make the following argument,

01:20:48.740 | which is humans are conscious,

01:20:53.740 | and if you believe that artificial neural nets

01:20:56.040 | are sufficiently similar to the brain,

01:20:59.500 | then there should at least exist artificial neural nets

01:21:02.660 | we should be conscious to.

01:21:04.220 | - You're leaning on that existence proof pretty heavily.

01:21:06.580 | Okay.

01:21:07.420 | - But that's the best answer I can give.

01:21:12.060 | - No, I know, I know, I know.

01:21:15.940 | There's still an open question

01:21:17.060 | if there's not some magic in the brain that we're not,

01:21:20.760 | I mean, I don't mean a non-materialistic magic,

01:21:23.580 | but that the brain might be a lot more complicated

01:21:27.780 | and interesting than we give it credit for.

01:21:29.860 | - If that's the case, then it should show up,

01:21:32.460 | and at some point we will find out

01:21:35.180 | that we can't continue to make progress,

01:21:36.620 | but I think it's unlikely.

01:21:38.780 | - So we talk about consciousness,

01:21:40.200 | but let me talk about another poorly defined concept

01:21:42.420 | of intelligence.

01:21:43.480 | Again, we've talked about reasoning,

01:21:46.900 | we've talked about memory,

01:21:48.140 | what do you think is a good test of intelligence for you?

01:21:51.700 | Are you impressed by the test that Alan Turing formulated

01:21:55.720 | with the imitation game with natural language?

01:21:58.600 | Is there something in your mind

01:22:01.140 | that you will be deeply impressed by

01:22:04.260 | if a system was able to do?

01:22:06.460 | - I mean, lots of things.

01:22:08.020 | There's a certain frontier of capabilities today,

01:22:12.140 | and there exist things outside of that frontier,

01:22:16.940 | and I would be impressed by any such thing.

01:22:18.980 | For example, I would be impressed by a deep learning system

01:22:23.980 | which solves a very pedestrian task,

01:22:27.300 | like machine translation or computer vision task

01:22:29.740 | or something which never makes mistake

01:22:33.460 | a human wouldn't make under any circumstances.

01:22:37.340 | I think that is something

01:22:38.620 | which have not yet been demonstrated,

01:22:40.100 | and I would find it very impressive.

01:22:41.940 | - Yeah, so right now they make mistakes,

01:22:44.940 | they might be more accurate than human beings,

01:22:46.660 | but they still, they make a different set of mistakes.

01:22:49.180 | - So I would guess that a lot of the skepticism

01:22:53.500 | that some people have about deep learning

01:22:55.820 | is when they look at their mistakes and they say,

01:22:57.340 | "Well, those mistakes, they make no sense."

01:23:00.260 | Like if you understood the concept,

01:23:01.660 | you wouldn't make that mistake.

01:23:03.180 | And I think that changing that would inspire me,

01:23:09.100 | that would be, yes, this is progress.

01:23:12.580 | - Yeah, that's a really nice way to put it.

01:23:15.460 | But I also just don't like that human instinct

01:23:18.580 | to criticize a model as not intelligent.

01:23:21.540 | That's the same instinct as we do

01:23:23.180 | when we criticize any group of creatures as the other.

01:23:27.780 | Because it's very possible that GPT-2

01:23:33.460 | is much smarter than human beings at many things.

01:23:36.380 | - That's definitely true.

01:23:37.580 | It has a lot more breadth of knowledge.

01:23:39.340 | - Yes, breadth of knowledge,

01:23:40.980 | and even perhaps depth on certain topics.

01:23:44.960 | - It's kind of hard to judge what depth means,

01:23:48.340 | but there's definitely a sense

01:23:49.940 | in which humans don't make mistakes that these models do.

01:23:54.500 | - Yes, the same is applied to autonomous vehicles.

01:23:57.780 | The same is probably gonna continue being applied

01:23:59.700 | to a lot of artificial intelligence systems.

01:24:01.740 | We find, this is the annoying thing,

01:24:04.140 | this is the process of, in the 21st century,

01:24:06.800 | the process of analyzing the progress of AI

01:24:09.460 | is the search for one case where the system fails

01:24:13.380 | in a big way where humans would not,

01:24:17.020 | and then many people writing articles about it,

01:24:20.660 | and then broadly as the public generally gets convinced

01:24:24.820 | that the system is not intelligent.

01:24:26.580 | And we like pacify ourselves by thinking it's not intelligent

01:24:29.860 | because of this one anecdotal case.

01:24:31.980 | And this seems to continue happening.

01:24:34.560 | - Yeah, I mean, there is truth to that.

01:24:36.900 | Although I'm sure that plenty of people

01:24:38.140 | are also extremely impressed by the system that exists today.

01:24:40.860 | But I think this connects to the earlier point we discussed

01:24:43.140 | that it's just confusing to judge progress in AI.

01:24:47.080 | - Yeah.

01:24:47.920 | - And you have a new robot demonstrating something.

01:24:50.760 | How impressed should you be?

01:24:52.760 | And I think that people will start to be impressed

01:24:56.020 | once AI starts to really move the needle on the GDP.

01:24:59.380 | - So you're one of the people that might be able

01:25:02.080 | to create an AGI system here, not you, but you and OpenAI.

01:25:05.740 | If you do create an AGI system

01:25:09.080 | and you get to spend sort of the evening with it, him, her,

01:25:14.840 | what would you talk about, do you think?

01:25:16.840 | - The very first time?

01:25:19.200 | - First time.

01:25:20.040 | - Well, the first time I would just ask all kinds

01:25:23.240 | of questions and try to get it to make a mistake.

01:25:25.800 | And I would be amazed that it doesn't make mistakes.

01:25:28.200 | And I just keep asking broad questions.

01:25:33.200 | - What kind of questions do you think,

01:25:35.000 | would they be factual or would they be personal,

01:25:39.160 | emotional, psychological?

01:25:41.000 | What do you think?

01:25:42.560 | - All of the above.

01:25:44.000 | (Lex laughing)

01:25:46.160 | - Would you ask for advice?

01:25:47.320 | - Definitely.

01:25:48.160 | (Lex laughing)

01:25:49.360 | I mean, why would I limit myself

01:25:51.640 | talking to a system like this?

01:25:53.200 | - Now, again, let me emphasize the fact

01:25:56.160 | that you truly are one of the people

01:25:57.860 | that might be in the room where this happens.

01:26:00.480 | So let me ask sort of a profound question about,

01:26:06.320 | I just talked to a Stalin historian.

01:26:08.440 | I've been talking to a lot of people who are studying power.

01:26:13.220 | Abraham Lincoln said, "Nearly all men can stand adversity,

01:26:17.760 | "but if you want to test a man's character, give him power."

01:26:21.440 | I would say the power of the 21st century,

01:26:24.720 | maybe the 22nd, but hopefully the 21st,

01:26:28.480 | would be the creation of an AGI system

01:26:30.280 | and the people who have control,

01:26:33.440 | direct possession and control of the AGI system.

01:26:36.300 | So what do you think, after spending that evening

01:26:41.320 | having a discussion with the AGI system,

01:26:44.220 | what do you think you would do?

01:26:45.780 | - Well, the ideal world I'd like to imagine

01:26:49.180 | is one where humanity are like the board members

01:26:56.500 | of a company where the AGI is the CEO.

01:27:00.780 | So it would be, I would like,

01:27:06.980 | the picture which I would imagine

01:27:08.660 | is you have some kind of different entities,

01:27:12.420 | different countries or cities,

01:27:14.620 | and the people that leave their vote

01:27:16.100 | for what the AGI that represents them should do,

01:27:19.100 | and then AGI that represents them goes and does it.

01:27:21.420 | I think a picture like that, I find very appealing.

01:27:26.380 | And you could have multiple,

01:27:27.220 | you would have an AGI for a city, for a country,

01:27:29.400 | and it would be trying to, in effect,

01:27:34.020 | take the democratic process to the next level.

01:27:36.100 | - And the board can almost fire the CEO.

01:27:38.700 | - Essentially, press the reset button, say.

01:27:40.700 | - Press the reset button.

01:27:41.540 | - Rerandomize the parameters.

01:27:42.980 | - Well, let me sort of, that's actually,

01:27:46.020 | okay, that's a beautiful vision, I think,

01:27:49.100 | as long as it's possible to press the reset button.

01:27:52.440 | Do you think it will always be possible

01:27:55.020 | to press the reset button?

01:27:56.400 | - So I think that it's definitely will be possible to build.

01:28:00.420 | So you're talking, so the question

01:28:03.900 | that I really understand from you is,

01:28:06.300 | humans people have control over the AI systems that they build.

01:28:14.300 | - Yes.

01:28:15.140 | - And my answer is, it's definitely possible

01:28:17.340 | to build AI systems which will want

01:28:19.580 | to be controlled by their humans.

01:28:21.860 | - Wow, that's part of their,

01:28:24.060 | so it's not that just they can't help but be controlled,

01:28:26.220 | but that's,

01:28:27.060 | they exist, one of the objectives of their existence

01:28:33.540 | is to be controlled.

01:28:34.540 | In the same way that human parents

01:28:37.780 | generally want to help their children,

01:28:42.460 | they want their children to succeed,

01:28:44.420 | it's not a burden for them,

01:28:46.040 | they are excited to help the children

01:28:48.740 | and to feed them and to dress them and to take care of them.

01:28:52.700 | And I believe with high conviction

01:28:56.320 | that the same will be possible for an AGI.

01:28:58.940 | It will be possible to program an AGI,

01:29:00.540 | to design it in such a way that it will have

01:29:02.300 | a similar deep drive, that it will be delighted to fulfill

01:29:07.060 | and the drive will be to help humans flourish.

01:29:09.940 | - But let me take a step back to that moment

01:29:13.980 | where you create the AGI system.

01:29:15.500 | I think this is a really crucial moment.

01:29:17.500 | And between that moment and the Democratic board members

01:29:24.000 | with the AGI at the head,

01:29:27.740 | there has to be a relinquishing of power.

01:29:31.860 | So as George Washington, despite all the bad things he did,

01:29:36.500 | one of the big things he did is he relinquished power.

01:29:39.380 | He, first of all, didn't want to be president.

01:29:42.180 | And even when he became president,

01:29:43.740 | he didn't keep just serving

01:29:45.940 | as most dictators do for indefinitely.

01:29:48.080 | Do you see yourself being able to relinquish control

01:29:54.140 | over an AGI system, given how much power

01:29:57.580 | you can have over the world?

01:29:59.300 | At first financial, just make a lot of money, right?

01:30:02.780 | And then control by having possession of this AGI system.

01:30:07.060 | - I'd find it trivial to do that.

01:30:09.060 | I'd find it trivial to relinquish this kind of power.

01:30:11.500 | I mean, the kind of scenario you are describing

01:30:15.100 | sounds terrifying to me.

01:30:17.400 | That's all.

01:30:19.000 | I would absolutely not want to be in that position.

01:30:22.420 | - Do you think you represent the majority

01:30:25.680 | or the minority of people in the AI community?

01:30:29.420 | - Well, I mean--

01:30:30.740 | - It's an open question and an important one.

01:30:33.740 | Are most people good is another way to ask it.

01:30:36.500 | - So I don't know if most people are good,

01:30:39.340 | but I think that when it really counts,

01:30:44.340 | people can be better than we think.

01:30:46.120 | - That's beautifully put, yeah.

01:30:49.300 | Are there specific mechanism you can think of

01:30:51.540 | of aligning AI gene values to human values?

01:30:54.620 | Is that, do you think about these problems

01:30:56.720 | of continued alignment as we develop the AI systems?

01:31:00.380 | - Yeah, definitely.

01:31:01.420 | In some sense, the kind of question which you are asking is,

01:31:07.380 | so if I were to translate the question to today's terms,

01:31:10.700 | it would be a question about how to get an RL agent

01:31:15.700 | that's optimizing a value function which itself is learned.

01:31:21.220 | And if you look at humans, humans are like that

01:31:23.220 | because the reward function, the value function of humans

01:31:26.300 | is not external, it is internal.

01:31:28.860 | - That's right.

01:31:30.180 | - And there are definite ideas

01:31:33.900 | of how to train a value function.

01:31:36.800 | Basically an objective,

01:31:39.140 | and as objective as possible perception system

01:31:41.580 | that will be trained separately to recognize,

01:31:47.580 | to internalize human judgments on different situations.

01:31:52.020 | And then that component would then be integrated

01:31:54.700 | as the base value function

01:31:56.540 | for some more capable RL system.

01:31:59.060 | You could imagine a process like this.

01:32:00.620 | I'm not saying this is the process,

01:32:02.460 | I'm saying this is an example

01:32:03.820 | of the kind of thing you could do.

01:32:05.740 | - So on that topic of the objective functions

01:32:11.180 | of human existence,

01:32:12.140 | what do you think is the objective function

01:32:15.060 | that's implicit in human existence?

01:32:17.460 | What's the meaning of life?

01:32:18.940 | - Oh.

01:32:20.780 | (sighs)

01:32:22.780 | I think the question is wrong in some way.

01:32:31.500 | I think that the question implies

01:32:33.820 | that there is an objective answer

01:32:35.660 | which is an external answer.

01:32:36.620 | You know, your meaning of life is X.

01:32:38.620 | I think what's going on is that we exist

01:32:40.780 | and that's amazing.

01:32:44.260 | And we should try to make the most of it

01:32:45.700 | and try to maximize our own value and enjoyment

01:32:48.940 | of our very short time while we do exist.

01:32:53.260 | - It's funny 'cause action does require

01:32:55.340 | an objective function.

01:32:56.220 | It's definitely there in some form,

01:32:58.620 | but it's difficult to make it explicit

01:33:01.100 | and maybe impossible to make it explicit,

01:33:02.860 | I guess is what you're getting at.

01:33:03.980 | And that's an interesting fact of an RL environment.

01:33:08.100 | - Well, I was making a slightly different point

01:33:10.540 | is that humans want things

01:33:13.340 | and their wants create the drives that cause them to,

01:33:17.580 | our wants are our objective functions,

01:33:19.900 | our individual objective functions.

01:33:21.980 | We can later decide that we want to change,

01:33:24.340 | that what we wanted before is no longer good

01:33:26.060 | and we want something else.

01:33:27.300 | - Yeah, but they're so dynamic.

01:33:29.020 | There's gotta be some underlying sort of Freud,

01:33:32.180 | there's things, there's like sexual stuff,

01:33:34.000 | there's people who think it's the fear of death

01:33:37.220 | and there's also the desire for knowledge

01:33:40.340 | and all these kinds of things, procreation,

01:33:43.220 | sort of all the evolutionary arguments.

01:33:46.220 | It seems to be,

01:33:47.140 | there might be some kind of fundamental objective function

01:33:50.420 | from which everything else emerges,

01:33:54.140 | but it seems like it's very difficult to make it explicit.

01:33:56.860 | - I think that probably is an evolutionary objective

01:33:58.620 | function, which is to survive and procreate

01:34:00.260 | and make your children succeed.

01:34:02.580 | That would be my guess,

01:34:04.300 | but it doesn't give an answer to the question

01:34:06.900 | of what's the meaning of life.

01:34:08.220 | I think you can see how humans are part of this big process,

01:34:13.300 | this ancient process, we exist on a small planet

01:34:18.300 | and that's it.

01:34:19.940 | So given that we exist, try to make the most of it

01:34:24.260 | and try to enjoy more and suffer less as much as we can.

01:34:28.120 | - Let me ask two silly questions about life.

01:34:31.300 | One, do you have regrets,

01:34:34.820 | moments that if you went back, you would do differently?

01:34:39.020 | And two, are there moments that you're especially proud of

01:34:42.340 | that made you truly happy?

01:34:43.640 | - So I can answer both questions.

01:34:47.540 | Of course, there's a huge number of choices

01:34:51.300 | and decisions that I've made

01:34:52.460 | that with the benefit of hindsight, I wouldn't have made them

01:34:55.500 | and I do experience some regret,

01:34:56.980 | but I try to take solace in the knowledge

01:35:00.140 | that at the time I did the best I could.

01:35:02.140 | And in terms of things that I'm proud of,

01:35:04.700 | I'm very fortunate to have done things I'm proud of

01:35:07.660 | and they made me happy for some time,

01:35:10.940 | but I don't think that that is the source of happiness.

01:35:13.700 | - So your academic accomplishments, all the papers,

01:35:17.420 | you're one of the most cited people in the world,

01:35:20.020 | all of the breakthroughs I mentioned

01:35:21.780 | in computer vision and language and so on,

01:35:23.880 | what is the source of happiness and pride for you?

01:35:29.620 | - I mean, all those things are a source of pride for sure.

01:35:31.460 | I'm very grateful for having done all those things

01:35:35.260 | and it was very fun to do them,

01:35:37.540 | but happiness comes, but you know, you can,

01:35:39.300 | happiness, well, my current view is that happiness

01:35:42.340 | comes from our, to a very large degree,

01:35:45.300 | from the way we look at things.

01:35:47.780 | You know, you can have a simple meal

01:35:49.220 | and be quite happy as a result,

01:35:51.380 | or you can talk to someone and be happy as a result as well.

01:35:54.900 | Or conversely, you can have a meal and be disappointed

01:35:58.220 | that the meal wasn't a better meal.

01:36:00.460 | So I think a lot of happiness comes from that,

01:36:02.380 | but I'm not sure, I don't wanna be too confident.

01:36:04.380 | (laughs)

01:36:05.580 | - Being humble in the face of the uncertainty

01:36:07.860 | seems to be also a part of this whole happiness thing.

01:36:12.180 | Well, I don't think there's a better way to end it

01:36:14.100 | than meaning of life and discussions of happiness.

01:36:17.940 | So Ilya, thank you so much.

01:36:19.740 | You've given me a few incredible ideas.

01:36:22.620 | You've given the world many incredible ideas.

01:36:24.900 | I really appreciate it and thanks for talking today.

01:36:27.500 | - Yeah, thanks for stopping by, I really enjoyed it.

01:36:30.540 | - Thanks for listening to this conversation

01:36:32.060 | with Ilya Sutskever,

01:36:33.340 | and thank you to our presenting sponsor, Cash App.

01:36:36.380 | Please consider supporting the podcast

01:36:38.140 | by downloading Cash App and using the code LEXPODCAST.

01:36:42.620 | If you enjoy this podcast, subscribe on YouTube,

01:36:45.420 | review it with five stars on Apple Podcast,

01:36:47.980 | support on Patreon, or simply connect with me on Twitter

01:36:51.460 | at Lex Friedman.

01:36:52.980 | And now let me leave you with some words

01:36:56.340 | from Alan Turing on machine learning.

01:36:58.900 | Instead of trying to produce a program

01:37:01.940 | to simulate the adult mind,

01:37:03.780 | why not rather try to produce one

01:37:06.300 | which simulates the child's?

01:37:08.780 | If this were then subjected

01:37:10.260 | to an appropriate course of education,

01:37:12.540 | one would obtain the adult brain.

01:37:15.220 | Thank you for listening and hope to see you next time.

01:37:19.300 | (upbeat music)

01:37:21.900 | (upbeat music)

01:37:24.500 | [BLANK_AUDIO]

Ilya Sutskever: Deep Learning | Lex Fridman Podcast #94

Chapters