back to indexIlya Sutskever: Deep Learning | Lex Fridman Podcast #94
Chapters
0:0 Introduction
2:23 AlexNet paper and the ImageNet moment
8:33 Cost functions
13:39 Recurrent neural networks
16:19 Key ideas that led to success of deep learning
19:57 What's harder to solve: language or vision?
29:35 We're massively underestimating deep learning
36:4 Deep double descent
41:20 Backpropagation
42:42 Can neural networks be made to reason?
50:35 Long-term memory
56:37 Language models
60:35 GPT-2
67:14 Active learning
68:52 Staged release of AI systems
73:41 How to build AGI?
85:0 Question to AGI
92:7 Meaning of life
00:00:00.000 |
The following is a conversation with Ilya Sutskever, 00:00:06.120 |
one of the most cited computer scientists in history 00:00:13.480 |
And to me, one of the most brilliant and insightful minds 00:00:21.680 |
who I would rather talk to and brainstorm with 00:00:24.040 |
about deep learning, intelligence, and life in general 00:00:37.200 |
For everyone feeling the medical, psychological, 00:00:43.160 |
Stay strong, we're in this together, we'll beat this thing. 00:00:54.040 |
support it on Patreon, or simply connect with me on Twitter 00:01:18.860 |
Cash App lets you send money to friends, buy Bitcoin, 00:01:22.060 |
invest in the stock market with as little as $1. 00:01:29.280 |
in the context of the history of money is fascinating. 00:01:47.160 |
and Bitcoin, the first decentralized cryptocurrency, 00:01:53.480 |
cryptocurrency is still very much in its early days 00:01:58.200 |
and just might, redefine the nature of money. 00:02:03.520 |
from the App Store or Google Play and use the code LEXPODCAST, 00:02:08.000 |
you get $10, and Cash App will also donate $10 to FIRST, 00:02:12.440 |
an organization that is helping advance robotics 00:02:14.820 |
and STEM education for young people around the world. 00:02:17.620 |
And now, here's my conversation with Ilya Sutskever. 00:02:39.560 |
what was your intuition about neural networks, 00:02:42.240 |
about the representational power of neural networks? 00:03:06.700 |
At some point, we realized that we can train very large, 00:03:18.540 |
At some point, different people obtained this result. 00:03:36.300 |
end to end, without pre-training, from scratch. 00:03:40.620 |
And when that happened, I thought, this is it. 00:03:43.940 |
Because if you can train a big neural network, 00:03:45.620 |
a big neural network can represent very complicated function. 00:03:49.500 |
Because if you have a neural network with 10 layers, 00:04:13.100 |
that we need to train a very big neural network 00:04:16.100 |
on lots of supervised data, and then it must succeed, 00:04:25.760 |
Today, we know that actually this theory is very incomplete 00:04:30.380 |
But definitely, if you have more data than parameters, 00:04:34.700 |
were heavily over-parameterized wasn't discouraging to you? 00:04:43.080 |
the fact there's a huge number of parameters is okay? 00:04:48.260 |
but the theory was that if you had a big dataset 00:04:53.060 |
The over-parameterization just didn't really figure much 00:05:04.460 |
will we have enough compute to train a big enough neural net? 00:05:23.460 |
- Was most of your intuition from empirical results 00:05:34.700 |
Or was there some pen and paper or marker and whiteboard 00:05:40.760 |
'Cause you just connected a 10-layer large neural network 00:05:44.740 |
to the brain, so you just mentioned the brain. 00:05:49.220 |
does the human brain come into play as a intuition builder? 00:05:55.020 |
I mean, you gotta be precise with these analogies 00:05:57.520 |
between artificial neural networks and the brain. 00:06:00.300 |
But there's no question that the brain is a huge source 00:06:04.100 |
of intuition and inspiration for deep learning researchers 00:06:07.460 |
since all the way from Rosenblatt in the '60s. 00:06:10.820 |
Like, if you look at, the whole idea of a neural network 00:06:15.740 |
You had people like McCallum and Pitts who were saying, 00:06:21.980 |
"And hey, we recently learned about the computer 00:06:36.020 |
Then you had the convolutional neural network 00:06:39.940 |
who said, "Hey, if you limit the receptive fields 00:06:42.020 |
"of a neural network, it's gonna be especially suitable 00:06:49.980 |
where analogies to the brain were successful. 00:06:52.380 |
And I thought, well, probably an artificial neuron 00:07:12.100 |
between the human brain, now I know you're probably 00:07:19.780 |
what's the difference between the human brain 00:07:21.260 |
and artificial neural networks that's interesting to you 00:07:27.380 |
What is an interesting difference between the brain 00:07:32.940 |
So I feel like today, artificial neural networks, 00:07:37.140 |
so we all agree that there are certain dimensions 00:07:39.420 |
in which the human brain vastly outperforms our models. 00:07:46.220 |
have a number of very important advantages over the brain. 00:07:50.200 |
Looking at the advantages versus disadvantages 00:07:52.580 |
is a good way to figure out what is the important difference. 00:07:55.640 |
So the brain uses spikes, which may or may not be important. 00:08:08.380 |
- It's hard to tell, but my prior is not very high 00:08:16.500 |
what they figured out is that they need to simulate 00:08:24.300 |
If you don't simulate the non-spiking neural networks 00:08:29.540 |
And that connects to questions around back propagation 00:08:40.460 |
It's not a self-evident question, especially if you, 00:08:45.860 |
let's say if you were just starting in the field 00:08:53.740 |
That's a great idea because the brain is a neural network, 00:08:55.900 |
so it would be useful to build neural networks. 00:09:00.420 |
It should be possible to train them probably, but how? 00:09:08.780 |
The cost function is a way of measuring the performance 00:09:14.920 |
By the way, that is a big, actually, let me think. 00:09:28.900 |
Is supervised learning a difficult concept to come to? 00:09:36.460 |
- Yeah, that's what, it seems trivial now, but I, 00:09:38.940 |
'cause the reason I ask that, and we'll talk about it, 00:09:43.460 |
Is there things that don't necessarily have a cost function, 00:09:50.900 |
or maybe a totally different kind of architectures? 00:09:57.980 |
- So the only, so the good examples of things 00:09:59.940 |
which don't have clear cost functions are GANs. 00:10:09.260 |
where you know that you have an algorithm gradient descent, 00:10:13.940 |
and then you can reason about the behavior of your system 00:10:20.020 |
"and I'll reason about the behavior of the system 00:10:24.540 |
But it's all about coming up with these mathematical objects 00:10:26.540 |
that help us reason about the behavior of our system. 00:10:31.180 |
Yeah, so GAN is the only one, it's kind of a, 00:10:33.420 |
the cost function is emergent from the comparison. 00:10:36.900 |
- It's, I don't know if it has a cost function. 00:10:41.360 |
It's kind of like the cost function of biological evolution 00:10:49.460 |
to which it will go towards, but I don't think, 00:10:53.780 |
I don't think the cost function analogy is the most useful. 00:10:57.500 |
- So if evolution doesn't, that's really interesting. 00:11:00.140 |
So if evolution doesn't really have a cost function, 00:11:04.940 |
something akin to our mathematical conception 00:11:09.900 |
of a cost function, then do you think cost functions 00:11:15.180 |
Yeah, so you just kind of mentioned that cost function 00:11:26.780 |
So self-play starts to touch on that a little bit 00:11:50.380 |
yet another profound way of looking at things 00:11:52.700 |
that will involve cost functions in a less central way. 00:11:55.620 |
But I don't know, I think cost functions are, I mean, 00:12:05.500 |
that pop into your mind that might be different 00:12:16.220 |
- I mean, one thing which may potentially be useful, 00:12:18.620 |
I think people, neuroscientists have figured out 00:12:20.580 |
something about the learning rule of the brain, 00:12:22.220 |
or I'm talking about spike time independent plasticity, 00:12:28.420 |
- Wait, sorry, spike time independent plasticity? 00:12:39.660 |
So it's kind of like, if a synapse fires into the neuron 00:12:42.580 |
before the neuron fires, then it's strengthen the synapse. 00:12:48.020 |
shortly after the neuron fired, then it weakens the synapse. 00:12:52.220 |
I'm 90% sure it's right, so if I said something wrong here, 00:13:01.060 |
But the timing, that's one thing that's missing. 00:13:07.460 |
I think that's like a fundamental property of the brain, 00:13:22.300 |
There's a clock, I guess, to recurrent neural networks. 00:13:30.100 |
the continuous version of that, the generalization, 00:13:36.060 |
and then within those timings is contained some information. 00:13:48.860 |
as the timing that seems to be important for the brain, 00:13:56.300 |
- I mean, I think recurrent neural networks are amazing, 00:14:00.700 |
and they can do, I think they can do anything 00:14:21.320 |
that we'll talk about on natural language processing 00:14:24.420 |
and language modeling has been with transformers 00:14:29.980 |
Do you think recurrence will make a comeback? 00:14:33.260 |
- Well, some kind of recurrence, I think, very likely. 00:14:38.700 |
as they're typically thought of for processing sequences, 00:14:44.420 |
- What is, to you, a recurrent neural network? 00:15:05.660 |
that's what, like, expert systems did, right? 00:15:12.380 |
growing a knowledge base is maintaining a hidden state, 00:15:18.460 |
and is growing it by sequentially processing. 00:15:20.300 |
Do you think of it more generally in that way? 00:15:22.700 |
Or is it simply, is it the more constrained form 00:15:27.700 |
of a hidden state with certain kind of gating units 00:15:31.340 |
that we think of as today with LSTMs and that? 00:15:37.860 |
that goes inside the LSTM or the RNN or something like this. 00:15:43.260 |
if you want to make the expert system analogy, I'm not, 00:16:19.340 |
Neural networks have been around for many decades, 00:16:24.220 |
that led to their success, that ImageNet moment, 00:16:35.540 |
the key ideas that led to the success of deep learning 00:16:42.900 |
behind deep learning has been around for much longer. 00:16:53.940 |
before deep learning started to be successful, 00:17:02.900 |
simply didn't think that neural networks could do much. 00:17:06.300 |
People didn't believe that large neural networks 00:17:10.580 |
People thought that, well, there was a lot of debate 00:17:34.060 |
That's when this field becomes a little bit more 00:17:43.540 |
The thing that was missing was a lot of supervised data 00:17:47.940 |
Once you have a lot of supervised data and a lot of compute, 00:17:52.620 |
then there is a third thing which is needed as well, 00:18:25.180 |
allowed the empirical evidence to do the convincing 00:18:29.660 |
of the majority of the computer science community. 00:18:48.260 |
and I think emission had served as that moment. 00:18:52.940 |
where the big pillars of computer vision community 00:19:01.500 |
And it's not enough for the ideas to all be there 00:19:06.300 |
it's for it to convince the cynicism that existed. 00:19:10.460 |
It's interesting that people just didn't believe 00:19:27.540 |
because neural networks really did not work on anything. 00:19:37.980 |
And that's why you need to have these very hard tasks 00:19:46.940 |
And that's why the field is making progress today 00:19:52.780 |
And this is why we are able to avoid endless debate. 00:20:03.060 |
in computer vision, language, natural language processing, 00:20:07.060 |
reinforcement learning, sort of everything in between. 00:20:12.540 |
There may not be a topic you haven't touched. 00:20:16.260 |
And of course, the fundamental science of deep learning. 00:20:19.660 |
What is the difference to you between vision, language, 00:20:39.660 |
Machine learning is a field with a lot of unity, 00:20:50.180 |
In fact, there's only one or two or three principles 00:20:57.380 |
in almost the same way to the different modalities 00:21:01.380 |
And that's why today, when someone writes a paper 00:21:04.140 |
on improving optimization of deep learning and vision, 00:21:12.340 |
Reinforcement learning, so I would say that computer vision 00:21:23.900 |
and we use convolutional neural networks in vision. 00:21:26.500 |
But it's also possible that one day this will change 00:21:28.900 |
and everything will be unified with a single architecture. 00:21:39.340 |
for every different tiny problem had its own architecture. 00:21:55.900 |
and sub, you know, little set of collection of skills, 00:21:58.620 |
people who would know how to engineer the features. 00:22:08.500 |
Or rather, I shouldn't say expect, I think it's possible. 00:22:16.820 |
RL does require slightly different techniques 00:22:20.780 |
You really do need to do something about exploration. 00:22:26.020 |
But I think there is a lot of unity even there. 00:22:31.140 |
broader unification between RL and supervised learning, 00:22:35.220 |
where somehow the RL will be making decisions 00:22:43.260 |
you shovel things into it and it just figures out 00:22:48.020 |
- I mean, reinforcement learning has some aspects 00:22:57.740 |
that you should be utilizing and there's elements 00:23:03.060 |
So it seems like the, it's like the union of the two 00:23:09.980 |
I'd say that reinforcement learning is neither, 00:23:17.360 |
- You think action is fundamentally different? 00:23:21.300 |
what is unique about policy of learning to act? 00:23:29.800 |
you are fundamentally in a non-stationary world 00:23:41.300 |
And this is not the case for the more traditional 00:23:44.140 |
static problem where you have some distribution 00:23:46.300 |
and you just apply a model to that distribution. 00:23:48.600 |
- You think it's a fundamentally different problem 00:23:53.900 |
it's a generalization of the problem of understanding? 00:23:56.980 |
- I mean, it's a question of definitions almost. 00:24:00.600 |
there's a huge amount of commonality for sure. 00:24:01.940 |
You take gradients, you try, you take gradients, 00:24:04.100 |
we try to approximate gradients in both cases. 00:24:06.100 |
In some, in the case of reinforcement learning, 00:24:16.260 |
You compute the gradient, you apply Adam in both cases. 00:24:28.900 |
It's really just a matter of your point of view, 00:24:48.060 |
is harder than visual scene understanding or vice versa? 00:24:59.460 |
- So what does it mean for a problem to be hard? 00:25:02.620 |
Okay, the non-interesting, dumb answer to that 00:25:10.700 |
and there's a human level performance on that benchmark. 00:25:20.620 |
until we get to human level on a very good benchmark. 00:25:28.900 |
So what I was going to say that a lot of it depends on, 00:25:32.060 |
you know, once you solve a problem, it stops being hard. 00:25:39.740 |
So, you know, you say today, true human level, 00:25:48.900 |
of solving the problem completely in the next three months. 00:25:55.420 |
my guess would be as good as yours, I don't know. 00:25:57.700 |
- Okay, so you don't have a fundamental intuition 00:26:04.300 |
I'd say language is probably going to be harder. 00:26:11.220 |
100% language understanding, I'll go with language. 00:26:17.980 |
with letters on it, is that, you see what I mean? 00:26:22.620 |
you say it's the best human level vision system. 00:26:25.100 |
I show you, I open a book and I show you letters. 00:26:36.100 |
- Yeah, so Chomsky would say it starts at language. 00:26:39.860 |
of the kind of structure and fundamental hierarchy 00:26:44.860 |
of ideas that's already represented in our brain somehow 00:26:51.380 |
But where does vision stop and language begin? 00:27:15.580 |
without basically using the same kind of system. 00:27:25.380 |
is probably that good that we can get the other. 00:27:30.180 |
And also, I think a lot of it really does depend 00:27:40.020 |
Because reading is vision, but should it count? 00:28:09.740 |
- Well, the ones, okay, so I'm a fan of monogamy, 00:28:20.020 |
it's possible to have somebody continuously giving you 00:28:22.980 |
pleasurable, interesting, witty, new ideas, friends. 00:28:32.100 |
- The surprise, it's that injection of randomness 00:28:42.940 |
continued inspiration, like the wit, the humor. 00:28:55.020 |
but I think if you have enough humans in the room. 00:29:03.100 |
I thought you meant to impress you with its intelligence, 00:29:13.300 |
and it's gonna get it right, and you're gonna say, wow, 00:29:15.980 |
Our systems of January 2020 have not been doing that. 00:29:22.300 |
like the reason people click like on stuff on the internet, 00:29:40.500 |
what is the most beautiful or surprising idea 00:29:43.180 |
in deep learning or AI in general you've come across? 00:29:46.860 |
- So I think the most beautiful thing about deep learning 00:29:57.700 |
And then you got some theories as to, you know, 00:30:05.940 |
then it will do the same function that the brain does. 00:30:14.180 |
and you make them larger and they keep getting better. 00:30:17.900 |
I find it unbelievable that this whole AI stuff 00:30:24.980 |
are there little bits and pieces of intuitions, 00:30:54.820 |
sort of empirical evidence kind of convinces you, 00:31:00.380 |
It shows you that, look, this evolutionary process 00:31:08.280 |
But it doesn't really get you to the insights 00:31:23.980 |
You know, you got around the experiment, it's important. 00:31:35.020 |
You say, yeah, let's make a big neural network. 00:31:37.420 |
And it's going to work much better than anything before it. 00:31:43.620 |
That's amazing when a theory is validated like this. 00:32:10.460 |
just to find the set of what biology represents. 00:32:21.020 |
it's really hard to have good predictive theory. 00:32:25.380 |
In physics, people make these super precise theories 00:32:29.340 |
And in machine learning, we're kind of in between. 00:32:33.820 |
if machine learning somehow helped us discover 00:33:04.060 |
from the past 10 years, I would say most of it, 00:33:08.900 |
where things that felt like really new ideas showed up. 00:33:12.900 |
But by and large, it was every year we thought, 00:33:19.020 |
And then the next year, okay, now this is big deep learning. 00:33:31.420 |
- Do you think it's getting harder and harder 00:33:46.140 |
it can be harder because there is a very large number 00:33:51.820 |
then you can make a lot of very interesting discoveries, 00:33:57.460 |
of managing a huge compute cluster to run your experiments. 00:34:06.460 |
but you're one of the smartest people I know, 00:34:09.500 |
So let's imagine all the breakthroughs that happen 00:34:15.260 |
Do you think most of those breakthroughs can be done 00:34:23.780 |
do you think compute and large efforts will be necessary? 00:34:33.900 |
When you say one computer, you mean how large? 00:34:51.020 |
The stack of deep learning is starting to be quite deep. 00:34:53.780 |
If you look at it, you've got all the way from the ideas, 00:35:04.180 |
the building the actual cluster, the GPU programming, 00:35:10.580 |
and I think it can be quite hard for a single person 00:35:17.900 |
- What about what like Vladimir Vapnik really insists on 00:35:22.100 |
is taking MNIST and trying to learn from very few examples. 00:35:29.060 |
Do you think there'll be breakthroughs in that space 00:35:34.860 |
- I think there will be a large number of breakthroughs 00:35:37.900 |
in general that will not need a huge amount of compute. 00:35:42.100 |
I think that some breakthroughs will require a lot of compute 00:35:45.380 |
and I think building systems which actually do things 00:35:51.340 |
If you want to do X and X requires a huge neural net, 00:35:59.340 |
I think there is lots of room for very important work 00:36:05.140 |
- Can you maybe sort of on the topic of the science 00:36:08.420 |
of deep learning, talk about one of the recent papers 00:36:23.580 |
So what happened is that some, over the years, 00:36:27.020 |
some small number of researchers noticed that 00:36:32.180 |
and it seems to go in contradiction with statistical ideas. 00:36:34.660 |
And then some people made an analysis showing 00:36:36.940 |
that actually you got this double descent bump. 00:36:38.940 |
And what we've done was to show that double descent occurs 00:36:42.780 |
for pretty much all practical deep learning systems. 00:36:46.420 |
And that it'll be also, so can you step back? 00:36:49.940 |
What's the X axis and the Y axis of a double descent plot? 00:37:10.020 |
So if you increase the size of the neural network slowly, 00:37:19.020 |
Then when the neural network is really small, 00:37:23.580 |
you get a very rapid increase in performance. 00:37:27.300 |
And at some point performance will get worse. 00:37:38.660 |
And then as you make it larger, it starts to get better again. 00:37:50.020 |
but it also occurs in the case of linear classifiers. 00:37:53.140 |
And the intuition basically boils down to the following. 00:37:55.940 |
When you have a large dataset and a small model, 00:38:02.020 |
then small, tiny random, so basically what is overfitting? 00:38:07.100 |
Overfitting is when your model is somehow very sensitive 00:38:11.980 |
to the small, random, unimportant stuff in your dataset. 00:38:18.980 |
So if you have a small model and you have a big dataset, 00:38:24.780 |
some training cases are randomly in the dataset 00:38:31.660 |
to this randomness because there is pretty much 00:38:43.340 |
that neural networks don't overfit every time, 00:38:48.340 |
very quickly, before ever being able to learn anything. 00:38:57.660 |
so maybe, so let me try to give the explanation 00:39:15.540 |
where your neural network achieves zero error. 00:39:18.060 |
And SGD is going to find approximately the point-- 00:39:22.620 |
- Approximately the point with the smallest norm 00:39:27.540 |
- And that can also be proven to be insensitive 00:39:48.860 |
So this is the best explanation, more or less. 00:39:54.020 |
to have more parameters, so to be bigger than the data. 00:39:58.660 |
- That's right, but only if you don't early stop. 00:40:00.860 |
If you introduce early stop in your regularization, 00:40:30.780 |
- Do you have any intuition why this happens? 00:40:41.260 |
that when the dataset has as many degrees of freedom 00:40:45.660 |
as the model, then there is a one-to-one correspondence 00:40:49.100 |
between them and so small changes to the dataset 00:40:55.100 |
So your model is very sensitive to all the randomness. 00:41:16.500 |
- Exactly, the spurious correlation which you don't want. 00:41:20.580 |
- Jeff Hinton suggested we need to throw back propagation. 00:41:23.540 |
We already kind of talked about this a little bit, 00:41:29.820 |
I mean, of course, some of that is a little bit 00:41:39.640 |
- Well, the thing that he said precisely is that 00:41:42.180 |
to the extent that you can't find back propagation 00:41:44.100 |
in the brain, it's worth seeing if we can learn something 00:41:47.680 |
from how the brain learns, but back propagation 00:41:58.140 |
we should also try to implement that in neural networks? 00:42:00.660 |
- If it turns out that we can't find back propagation 00:42:03.780 |
- If we can't find back propagation in the brain. 00:42:16.020 |
- I mean, I personally am a big fan of back propagation. 00:42:18.460 |
I think it's a great algorithm because it solves 00:42:23.100 |
finding a neural circuit subject to some constraints. 00:42:30.440 |
so that's why I really, I think it's pretty unlikely 00:42:35.000 |
that we'll have anything which is going to be 00:42:38.680 |
It could happen, but I wouldn't bet on it right now. 00:42:41.420 |
- So let me ask a sort of big picture question. 00:42:46.840 |
Do you think neural networks can be made to reason? 00:42:53.380 |
- Well, if you look, for example, at AlphaGo or AlphaZero, 00:43:01.740 |
which we all agree is a game that requires reasoning, 00:43:34.040 |
But yes, I think it has some of the same elements 00:43:43.180 |
There's a sequential element of step-wise consideration 00:43:49.180 |
of possibilities, and sort of building on top 00:43:53.320 |
of those possibilities in a sequential manner 00:43:57.640 |
So yeah, I guess playing Go is kind of like that. 00:44:00.520 |
And when you have a single neural network doing that 00:44:04.920 |
So there's an existence proof in a particular 00:44:06.760 |
constrained environment that a process akin to 00:44:33.400 |
will look similar to the neural network architectures 00:44:50.240 |
will be very similar to the architectures that exist today. 00:44:57.100 |
But these neural nets are so insanely powerful. 00:45:02.100 |
Why wouldn't they be able to learn to reason? 00:45:05.560 |
Humans can reason, so why can't neural networks? 00:45:11.640 |
neural networks do is a kind of just weak reasoning? 00:45:14.660 |
So it's not a fundamentally different process? 00:45:16.600 |
Again, this is stuff nobody knows the answer to. 00:45:30.560 |
which doesn't require reasoning, it's not going to reason. 00:45:34.020 |
This is a well-known effect where the neural network 00:45:36.360 |
will solve exactly the, it will solve the problem 00:45:39.320 |
that you pose in front of it in the easiest way possible. 00:46:09.200 |
- Yeah, so the thing which I said precisely was that 00:46:29.240 |
Now, you can also prove mathematically that it is, 00:46:33.920 |
which generates some data is not a computable operation. 00:46:52.880 |
the shortest program which generates our data, 00:47:01.620 |
even a large circuit which fits our data in some way. 00:47:05.320 |
- Well, I think what you meant by the small circuit 00:47:12.360 |
back then I really haven't fully internalized 00:47:17.080 |
The things we know about over-parameterized neural nets, 00:47:23.200 |
whose weights contain a small amount of information, 00:47:29.200 |
If you imagine the training process of a neural network 00:47:37.080 |
then somehow the amount of information in the weights 00:47:42.960 |
which would explain why they generalize so well. 00:47:45.240 |
- So that's, the large circuit might be one that's helpful 00:47:53.360 |
- But do you see it important to be able to try 00:48:04.880 |
I think it's kind of, the answer is kind of yes, 00:48:11.200 |
It's the reason we are pushing on deep learning, 00:48:23.920 |
We've got our pillar, which is the training pillar. 00:48:27.560 |
And now we are trying to contort our neural networks 00:48:36.440 |
And so being trainable means starting from scratch, 00:48:40.600 |
knowing nothing, you can actually pretty quickly 00:48:42.880 |
converge towards knowing a lot or even slowly. 00:48:45.920 |
But it means that given the resources at your disposal, 00:48:55.440 |
- Yeah, that's a pillar we can't move away from. 00:48:58.520 |
and whereas if you say, hey, let's find the shortest program, 00:49:02.840 |
So it doesn't matter how useful that would be. 00:49:09.920 |
that neural networks are good at finding small circuits 00:49:13.420 |
Do you think then the matter of finding small programs 00:49:34.600 |
of people successfully finding programs really well. 00:49:40.680 |
is you'd train a deep neural network to do it basically. 00:49:48.160 |
- But there's not good illustrations of that. 00:49:59.880 |
And put another way, you don't see why it's not possible. 00:50:04.200 |
- Well, it's kind of like more, it's more a statement of, 00:50:07.920 |
I think that it's unwise to bet against deep learning. 00:50:18.720 |
then it doesn't take too long for some deep neural net 00:50:27.840 |
I've stopped betting against neural networks at this point 00:50:42.200 |
So being able to aggregate important information 00:50:45.520 |
over long periods of time that would then serve 00:51:01.600 |
- So in some sense, the parameters already do that. 00:51:04.840 |
The parameters are an aggregation of the day, 00:51:07.920 |
of the neural, of the entirety of the neural experience. 00:51:10.920 |
And so they count as the long, as long-term knowledge. 00:51:21.520 |
people have investigated language models as knowledge basis. 00:51:27.320 |
- Yeah, but in some sense, do you think in every sense, 00:51:29.880 |
do you think there's a, it's all just a matter 00:51:40.240 |
'Cause right now, I mean, there's not been mechanisms 00:51:43.080 |
that do remember really long-term information. 00:51:51.760 |
So I'm thinking of the kind of compression of information 00:51:58.160 |
the knowledge bases represent, sort of creating a, 00:52:02.960 |
now, I apologize for my sort of human-centric thinking 00:52:06.920 |
about what knowledge is, 'cause neural networks 00:52:12.920 |
with the kind of knowledge they have discovered. 00:52:15.800 |
But a good example for me is knowledge bases, 00:52:18.740 |
being able to build up over time something like 00:52:30.840 |
Obviously not the actual Wikipedia or the language, 00:52:37.920 |
So it's a really nice compressed knowledge base, 00:52:40.360 |
or something akin to that in a non-interpretable sense 00:52:46.980 |
- Well, the neural networks would be non-interpretable 00:52:49.440 |
but their outputs should be very interpretable. 00:52:52.200 |
- Okay, so yeah, how do you make very smart neural networks 00:53:00.280 |
and the text will generally be interpretable. 00:53:02.120 |
- Do you find that the epitome of interpretability, 00:53:12.240 |
I would like the neural network to come up with examples 00:53:17.960 |
and examples where it's completely brilliant. 00:53:22.280 |
is to generate a lot of examples and use my human judgment. 00:53:52.280 |
there are actually two answers to that question. 00:53:54.360 |
One answer is, you know, we have the neural net, 00:53:58.520 |
and we can try to understand what the different neurons 00:54:11.400 |
where you say, you know, you look at a human being, 00:54:16.520 |
how do you know what a human being is thinking? 00:54:18.800 |
You ask them, you say, hey, what do you think about this? 00:54:25.640 |
in the sense you already have a mental model. 00:54:41.560 |
and then everything you ask, you're adding onto that. 00:54:49.800 |
that's one of the really interesting qualities 00:54:51.720 |
of the human being, is that information is sticky. 00:54:55.040 |
You don't, you seem to remember the useful stuff, 00:54:57.560 |
aggregate it well, and forget most of the information 00:55:06.800 |
It's just that neural networks are much crappier 00:55:10.680 |
It doesn't seem to be fundamentally that different. 00:55:13.280 |
But just to stick on reasoning for a little longer, 00:55:39.320 |
Solving open-ended problems with out-of-the-box solutions. 00:55:43.160 |
- And sort of theorem type mathematical problems. 00:55:49.520 |
- Yeah, I think those ones are a very natural example 00:55:56.620 |
And so by the way, and this comes back to the point 00:56:03.240 |
machine learning, deep learning as a field is very fortunate 00:56:06.120 |
because we have the ability to sometimes produce 00:56:14.320 |
We have the ability to produce conversation changing results. 00:56:19.540 |
- Conversation, and then of course, just like you said, 00:56:28.420 |
Yeah, that whole mortality thing is kind of a sticky problem 00:56:47.180 |
Can you briefly kind of try to describe the recent history 00:56:51.140 |
of using neural networks in the domain of language and text? 00:57:00.260 |
tiny recurrent neural network applied to language 00:57:03.900 |
So the history is really, you know, fairly long at least. 00:57:17.220 |
of all deep learning, and that's data and compute. 00:57:19.700 |
So suddenly you move from small language models, 00:57:31.660 |
because they're trying to predict the next word. 00:57:44.860 |
and there is a space between those characters. 00:57:47.980 |
And you'll notice that sometimes there is a comma 00:57:50.020 |
and then the next character is a capital letter. 00:58:14.060 |
'cause that's where you and Noam Chomsky disagree. 00:58:16.620 |
So you think we're actually taking incremental steps, 00:58:58.820 |
know precisely what Chomsky means when he talks about him. 00:59:12.740 |
when you inspect those larger language models, 00:59:14.700 |
they exhibit signs of understanding the semantics, 00:59:28.620 |
And we noticed that when you increase the size of the LSTM 00:59:35.420 |
then one of the neurons starts to represent the sentiment 00:59:49.460 |
but sentiment is whether it's a positive or negative review. 00:59:58.780 |
that a small neural net does not capture sentiment 01:00:11.060 |
- And with size, you quickly run out of syntax to model, 01:00:15.820 |
and then you really start to focus on the semantics, 01:00:28.260 |
of semantic understanding, partial semantic understanding, 01:00:30.780 |
but the smaller models do not show those signs. 01:00:34.540 |
- Can you take a step back and say, what is GPT-2, 01:00:50.380 |
that was trained on about 40 billion tokens of text, 01:01:03.940 |
- The transformer, it's the most important advance 01:01:06.740 |
in neural network architectures in recent history. 01:01:13.300 |
not necessarily sort of technically speaking, 01:01:17.500 |
versus maybe what recurring neural networks represent. 01:01:23.380 |
is a combination of multiple ideas simultaneously 01:01:59.400 |
The second thing is that transformer is not recurrent. 01:02:15.360 |
so therefore less deep and easier to optimize. 01:02:17.840 |
And the combination of those factors make it successful. 01:02:31.080 |
- Were you surprised how well transformers worked 01:02:42.880 |
So you got to see the whole set of revolutions 01:03:02.480 |
It was just amazing to see generate this text of this. 01:03:07.360 |
that at that time, you've seen all this progress in GANs, 01:03:10.480 |
in improving the samples produced by GANs were just amazing. 01:03:31.840 |
But then to see it with your own eyes, it's something else. 01:03:37.240 |
And now there's sort of some cognitive scientists 01:03:51.880 |
the fact that they're able to model the language so well is. 01:04:03.720 |
- Do you think that bar will continuously be moved? 01:04:08.840 |
really dramatic economic impact, that's when, 01:04:11.960 |
I think that's in some sense the next barrier. 01:04:13.800 |
Because right now, if you think about the work in AI, 01:04:18.880 |
It's really hard to know what to make of all these advances. 01:04:30.400 |
At some point, I think people who are outside of AI, 01:04:35.400 |
they can no longer distinguish this progress anymore. 01:04:41.760 |
and how there's a lot of brilliant work in Russian 01:04:44.120 |
that the rest of the world doesn't know about. 01:04:53.880 |
where we're going to see sort of economic big impact? 01:05:01.600 |
I want to point out that translation already today is huge. 01:05:16.400 |
I think self-driving is going to be hugely impactful. 01:05:20.320 |
And that's, you know, it's unknown exactly when it happens, 01:05:24.480 |
but again, I would not bet against deep learning. 01:05:28.000 |
- So that's deep learning in general, but you think-- 01:05:33.160 |
But I was talking about sort of language models. 01:05:36.200 |
- Just to check. - I veered off a little bit. 01:05:54.200 |
that can take on both language and vision tasks. 01:06:01.440 |
Now let's see, what can I ask about GPT-2 more? 01:06:06.980 |
It's, you take a transform, you make it bigger, 01:06:10.800 |
and suddenly it does all those amazing things. 01:06:12.700 |
- Yeah, one of the beautiful things is that GPT, 01:06:28.240 |
- Sort of like what are the next steps with GPT-2, 01:06:31.480 |
- I mean, I think for sure seeing what larger versions 01:06:41.240 |
There's one question which I'm curious about, 01:06:45.400 |
so we feed it all this data from the internet, 01:06:48.160 |
all those random facts about everything in the internet. 01:07:04.420 |
people don't learn all data indiscriminately. 01:07:21.200 |
can you just elaborate that a little bit more? 01:07:29.920 |
that the optimization of how you select data, 01:07:35.920 |
is going to be a place for a lot of breakthroughs, 01:07:42.200 |
because there hasn't been many breakthroughs there 01:07:45.160 |
I feel like there might be private breakthroughs 01:07:49.400 |
'cause it's a fundamental problem that has to be solved 01:07:55.360 |
What do you think about the space in general? 01:07:57.880 |
- Yeah, so I think that for something like active learning, 01:08:00.280 |
or in fact, for any kind of capability, like active learning, 01:08:08.020 |
It's very hard to do research about the capability 01:08:14.280 |
is that you will come up with an artificial task, 01:08:16.760 |
get good results, but not really convince anyone. 01:08:27.520 |
some clever formulation of MNIST will convince people. 01:08:33.680 |
with a simple active learning scheme on MNIST 01:09:14.040 |
you can start to imagine that it would be used by bots 01:09:21.680 |
So there's this nervousness about what it's possible to do. 01:09:32.240 |
powerful artificial intelligence models to the public? 01:09:42.160 |
about how we manage the use of the systems and so on? 01:09:51.760 |
that you've gathered from just thinking about this, 01:10:00.680 |
is that the field of AI has been in a state of childhood, 01:10:08.720 |
What that means is that AI is very successful 01:10:14.100 |
and its impact is not only large, but it's also growing. 01:10:16.940 |
And so for that reason, it seems wise to start thinking 01:10:22.820 |
about the impact of our systems before releasing them, 01:10:29.660 |
And with the case of GPT-2, like I mentioned earlier, 01:10:38.700 |
It seemed plausible that something like GPT-2 01:10:41.540 |
could easily use to reduce the cost of disinformation. 01:10:56.460 |
Many people use these models in lots of cool ways. 01:10:59.720 |
There've been lots of really cool applications. 01:11:02.020 |
There haven't been any negative applications we know of, 01:11:07.620 |
But also other people replicated similar models. 01:11:09.980 |
- That's an interesting question, though, that we know of. 01:11:16.060 |
is at least part of the answer to the question of how do we... 01:11:20.780 |
What do we do once we create a system like this? 01:11:29.980 |
Like, say you don't wanna release the model at all 01:11:32.400 |
because it's useful to you for whatever the business is. 01:11:35.860 |
- Well, plenty of people don't release models already. 01:11:39.940 |
but is there some moral, ethical responsibility 01:11:44.560 |
when you have a very powerful model to sort of communicate? 01:11:51.380 |
it was unclear how much it could be used for misinformation. 01:12:03.860 |
Please tell me there's some optimistic pathway 01:12:12.660 |
Or is it still really difficult from one company 01:12:21.300 |
It's definitely possible to discuss these kind of models 01:12:38.060 |
- I think that's a place where it's important 01:12:47.960 |
which is going to be increasingly more powerful. 01:12:59.460 |
I tend to believe in the better angels of our nature, 01:13:06.860 |
That when you build a really powerful AI system 01:13:27.340 |
that would push people to close that development 01:13:51.300 |
but in general, what does it take, do you think? 01:13:56.220 |
but I think the deep learning plus maybe another small idea. 01:14:05.620 |
So like you've spoken about the powerful mechanism 01:14:11.540 |
sort of exploring the world in a competitive setting 01:14:16.660 |
against other entities that are similarly skilled as them 01:14:30.340 |
I think is going to be deep learning plus some ideas. 01:14:35.060 |
And I think self-play will be one of those ideas. 01:14:40.660 |
Self-play has this amazing property that it can surprise us 01:14:49.620 |
For example, pretty much every self-play system, 01:14:54.620 |
both our Dota bot, I don't know if OpenAI had a release 01:15:00.460 |
about multi-agent where you had two little agents 01:15:11.060 |
They all produce behaviors that we didn't expect. 01:15:18.740 |
that our systems don't exhibit routinely right now. 01:15:21.380 |
And so that's why I like this area, I like this direction 01:15:28.460 |
And an AGI system would surprise us fundamentally. 01:15:31.260 |
- Yes, and to be precise, not just a random surprise, 01:15:34.580 |
but to find a surprising solution to a problem 01:15:40.060 |
Now, a lot of the self-play mechanisms have been used 01:15:43.580 |
in the game context or at least in a simulation context. 01:15:56.700 |
How much faith, promise do you have in simulation 01:16:04.500 |
in the real world, whether it's the real world 01:16:17.540 |
It has certain strengths and certain weaknesses 01:16:24.460 |
That's true, but one of the criticisms of self-play, 01:16:32.740 |
one of the criticisms of reinforcement learning 01:16:50.780 |
and be able to learn in non-simulated environments? 01:16:53.420 |
Or do you think it's possible to also just simulate 01:16:57.020 |
in a photorealistic and physics realistic way, 01:17:01.100 |
the real world in a way that we can solve real problems 01:17:18.660 |
Also, OpenAI in the summer has demonstrated a robot hand 01:17:32.660 |
- I wasn't aware that was trained in simulation. 01:17:40.980 |
- No, 100% of the training was done in simulation 01:17:44.820 |
and the policy that was learned in simulation 01:17:50.940 |
it could very quickly adapt to the physical world. 01:17:53.940 |
- So the kind of perturbations with the giraffe 01:18:08.140 |
but not the kind of perturbations we've had in the video. 01:18:12.660 |
it's never been trained with a stuffed giraffe. 01:18:17.060 |
- So in theory, these are novel perturbations. 01:18:25.100 |
That's a clean, small scale, but clean example 01:18:38.180 |
And the better the transfer capabilities are, 01:18:50.220 |
which you could then carry with you to the real world. 01:18:53.420 |
As humans do all the time when they play computer games. 01:19:03.460 |
Do you think AGI says that we need to have a body? 01:19:12.900 |
sort of fear of mortality, sort of self-preservation 01:19:16.580 |
in the physical space, which comes with having a body. 01:19:24.260 |
But I think it's very useful to have a body for sure, 01:19:28.820 |
you can learn things which cannot be learned without a body. 01:19:34.420 |
if you don't have a body, you could compensate for it 01:19:44.260 |
and they were able to compensate for the lack of modalities. 01:19:50.380 |
- So even if you're not able to physically interact 01:20:02.620 |
I'm not sure if it's connected to having a body or not, 01:20:07.820 |
And a more constrained version of that is self-awareness. 01:20:11.220 |
Do you think an AGI system should have consciousness? 01:20:17.300 |
whatever the heck you think consciousness is. 01:20:37.740 |
from the representation that's stored within your networks? 01:20:45.080 |
you're able to represent more and more of the world? 01:20:47.000 |
- Well, I'd say, I'd make the following argument, 01:20:53.740 |
and if you believe that artificial neural nets 01:20:59.500 |
then there should at least exist artificial neural nets 01:21:04.220 |
- You're leaning on that existence proof pretty heavily. 01:21:17.060 |
if there's not some magic in the brain that we're not, 01:21:20.760 |
I mean, I don't mean a non-materialistic magic, 01:21:23.580 |
but that the brain might be a lot more complicated 01:21:29.860 |
- If that's the case, then it should show up, 01:21:40.200 |
but let me talk about another poorly defined concept 01:21:48.140 |
what do you think is a good test of intelligence for you? 01:21:51.700 |
Are you impressed by the test that Alan Turing formulated 01:21:55.720 |
with the imitation game with natural language? 01:22:08.020 |
There's a certain frontier of capabilities today, 01:22:12.140 |
and there exist things outside of that frontier, 01:22:18.980 |
For example, I would be impressed by a deep learning system 01:22:27.300 |
like machine translation or computer vision task 01:22:33.460 |
a human wouldn't make under any circumstances. 01:22:44.940 |
they might be more accurate than human beings, 01:22:46.660 |
but they still, they make a different set of mistakes. 01:22:49.180 |
- So I would guess that a lot of the skepticism 01:22:55.820 |
is when they look at their mistakes and they say, 01:23:03.180 |
And I think that changing that would inspire me, 01:23:15.460 |
But I also just don't like that human instinct 01:23:23.180 |
when we criticize any group of creatures as the other. 01:23:33.460 |
is much smarter than human beings at many things. 01:23:44.960 |
- It's kind of hard to judge what depth means, 01:23:49.940 |
in which humans don't make mistakes that these models do. 01:23:54.500 |
- Yes, the same is applied to autonomous vehicles. 01:23:57.780 |
The same is probably gonna continue being applied 01:24:09.460 |
is the search for one case where the system fails 01:24:17.020 |
and then many people writing articles about it, 01:24:20.660 |
and then broadly as the public generally gets convinced 01:24:26.580 |
And we like pacify ourselves by thinking it's not intelligent 01:24:38.140 |
are also extremely impressed by the system that exists today. 01:24:40.860 |
But I think this connects to the earlier point we discussed 01:24:43.140 |
that it's just confusing to judge progress in AI. 01:24:47.920 |
- And you have a new robot demonstrating something. 01:24:52.760 |
And I think that people will start to be impressed 01:24:56.020 |
once AI starts to really move the needle on the GDP. 01:24:59.380 |
- So you're one of the people that might be able 01:25:02.080 |
to create an AGI system here, not you, but you and OpenAI. 01:25:09.080 |
and you get to spend sort of the evening with it, him, her, 01:25:20.040 |
- Well, the first time I would just ask all kinds 01:25:23.240 |
of questions and try to get it to make a mistake. 01:25:25.800 |
And I would be amazed that it doesn't make mistakes. 01:25:35.000 |
would they be factual or would they be personal, 01:25:57.860 |
that might be in the room where this happens. 01:26:00.480 |
So let me ask sort of a profound question about, 01:26:08.440 |
I've been talking to a lot of people who are studying power. 01:26:13.220 |
Abraham Lincoln said, "Nearly all men can stand adversity, 01:26:17.760 |
"but if you want to test a man's character, give him power." 01:26:33.440 |
direct possession and control of the AGI system. 01:26:36.300 |
So what do you think, after spending that evening 01:26:49.180 |
is one where humanity are like the board members 01:27:16.100 |
for what the AGI that represents them should do, 01:27:19.100 |
and then AGI that represents them goes and does it. 01:27:21.420 |
I think a picture like that, I find very appealing. 01:27:27.220 |
you would have an AGI for a city, for a country, 01:27:34.020 |
take the democratic process to the next level. 01:27:49.100 |
as long as it's possible to press the reset button. 01:27:56.400 |
- So I think that it's definitely will be possible to build. 01:28:06.300 |
humans people have control over the AI systems that they build. 01:28:24.060 |
so it's not that just they can't help but be controlled, 01:28:27.060 |
they exist, one of the objectives of their existence 01:28:48.740 |
and to feed them and to dress them and to take care of them. 01:29:02.300 |
a similar deep drive, that it will be delighted to fulfill 01:29:07.060 |
and the drive will be to help humans flourish. 01:29:17.500 |
And between that moment and the Democratic board members 01:29:31.860 |
So as George Washington, despite all the bad things he did, 01:29:36.500 |
one of the big things he did is he relinquished power. 01:29:39.380 |
He, first of all, didn't want to be president. 01:29:48.080 |
Do you see yourself being able to relinquish control 01:29:59.300 |
At first financial, just make a lot of money, right? 01:30:02.780 |
And then control by having possession of this AGI system. 01:30:09.060 |
I'd find it trivial to relinquish this kind of power. 01:30:11.500 |
I mean, the kind of scenario you are describing 01:30:19.000 |
I would absolutely not want to be in that position. 01:30:25.680 |
or the minority of people in the AI community? 01:30:30.740 |
- It's an open question and an important one. 01:30:33.740 |
Are most people good is another way to ask it. 01:30:49.300 |
Are there specific mechanism you can think of 01:30:56.720 |
of continued alignment as we develop the AI systems? 01:31:01.420 |
In some sense, the kind of question which you are asking is, 01:31:07.380 |
so if I were to translate the question to today's terms, 01:31:10.700 |
it would be a question about how to get an RL agent 01:31:15.700 |
that's optimizing a value function which itself is learned. 01:31:21.220 |
And if you look at humans, humans are like that 01:31:23.220 |
because the reward function, the value function of humans 01:31:39.140 |
and as objective as possible perception system 01:31:41.580 |
that will be trained separately to recognize, 01:31:47.580 |
to internalize human judgments on different situations. 01:31:52.020 |
And then that component would then be integrated 01:32:05.740 |
- So on that topic of the objective functions 01:32:45.700 |
and try to maximize our own value and enjoyment 01:33:03.980 |
And that's an interesting fact of an RL environment. 01:33:08.100 |
- Well, I was making a slightly different point 01:33:13.340 |
and their wants create the drives that cause them to, 01:33:29.020 |
There's gotta be some underlying sort of Freud, 01:33:34.000 |
there's people who think it's the fear of death 01:33:47.140 |
there might be some kind of fundamental objective function 01:33:54.140 |
but it seems like it's very difficult to make it explicit. 01:33:56.860 |
- I think that probably is an evolutionary objective 01:34:04.300 |
but it doesn't give an answer to the question 01:34:08.220 |
I think you can see how humans are part of this big process, 01:34:13.300 |
this ancient process, we exist on a small planet 01:34:19.940 |
So given that we exist, try to make the most of it 01:34:24.260 |
and try to enjoy more and suffer less as much as we can. 01:34:34.820 |
moments that if you went back, you would do differently? 01:34:39.020 |
And two, are there moments that you're especially proud of 01:34:52.460 |
that with the benefit of hindsight, I wouldn't have made them 01:35:04.700 |
I'm very fortunate to have done things I'm proud of 01:35:10.940 |
but I don't think that that is the source of happiness. 01:35:13.700 |
- So your academic accomplishments, all the papers, 01:35:17.420 |
you're one of the most cited people in the world, 01:35:23.880 |
what is the source of happiness and pride for you? 01:35:29.620 |
- I mean, all those things are a source of pride for sure. 01:35:31.460 |
I'm very grateful for having done all those things 01:35:39.300 |
happiness, well, my current view is that happiness 01:35:51.380 |
or you can talk to someone and be happy as a result as well. 01:35:54.900 |
Or conversely, you can have a meal and be disappointed 01:36:00.460 |
So I think a lot of happiness comes from that, 01:36:02.380 |
but I'm not sure, I don't wanna be too confident. 01:36:05.580 |
- Being humble in the face of the uncertainty 01:36:07.860 |
seems to be also a part of this whole happiness thing. 01:36:12.180 |
Well, I don't think there's a better way to end it 01:36:14.100 |
than meaning of life and discussions of happiness. 01:36:22.620 |
You've given the world many incredible ideas. 01:36:24.900 |
I really appreciate it and thanks for talking today. 01:36:27.500 |
- Yeah, thanks for stopping by, I really enjoyed it. 01:36:33.340 |
and thank you to our presenting sponsor, Cash App. 01:36:38.140 |
by downloading Cash App and using the code LEXPODCAST. 01:36:42.620 |
If you enjoy this podcast, subscribe on YouTube, 01:36:47.980 |
support on Patreon, or simply connect with me on Twitter 01:37:15.220 |
Thank you for listening and hope to see you next time.