back to indexIlya Sutskever (OpenAI) and Jensen Huang (NVIDIA CEO) : AI Today and Vision of the Future (3/2023)
00:00:14.840 |
my mental memory of the time that I've known you 00:00:26.060 |
the co-invention of AlexNet with Alex and Jeb Hinton 00:00:32.760 |
that led to the big bang of modern artificial intelligence, 00:00:38.360 |
your career that took you out here to the Bay Area, 00:00:52.320 |
This is the incredible resume of a young computer scientist, 00:01:02.460 |
I guess I just want to go back to the beginning 00:01:08.920 |
what was your intuition around deep learning? 00:01:14.920 |
that it was going to lead to this kind of success? 00:01:19.840 |
thank you so much for the quote, for all the kind words. 00:01:26.180 |
thanks to the incredible power of deep learning. 00:01:51.940 |
and it felt like progress in artificial intelligence 00:01:58.740 |
well, back then, I was starting in 2002, 2003, 00:02:17.520 |
and it wasn't even clear that it was possible in theory. 00:02:21.520 |
And so I thought that making progress in learning, 00:02:30.060 |
that would lead to the greatest progress in AI. 00:02:33.400 |
And then I started to look around for what was out there, 00:02:42.040 |
Jeff Hinton was a professor at my university, 00:02:55.900 |
we are automatically programming parallel computers. 00:03:00.040 |
Back then, the parallel computers were small, 00:03:02.700 |
but the promise was, if you could somehow figure out 00:03:07.900 |
then you can program small parallel computers from data. 00:03:14.520 |
because it's like you had these several factors going for it. 00:03:24.800 |
that seemed like it had by far the greatest long-term promise. 00:03:37.300 |
What was the scale of computing at that moment in time? 00:03:42.400 |
was that the importance of scale wasn't realized back then. 00:03:47.440 |
neural networks with like 50 neurons, 100 neurons, 00:03:55.140 |
A million parameters would be considered very large. 00:03:58.360 |
We would run our models on unoptimized CPU code, 00:04:11.980 |
you know, what is even the right question to ask, you know? 00:04:26.640 |
about training neural nets on small little digits, 00:04:33.640 |
and also he was very interested in generating them. 00:04:36.440 |
So the beginnings of generative models were right there. 00:04:47.640 |
so it wasn't obvious that this was the right question 00:04:52.400 |
but in hindsight, that turned out to be the right question. 00:05:17.800 |
that ImageNet was the right set of data to go for, 00:05:20.860 |
and to somehow go for the computer vision contest? 00:05:30.100 |
It's, I think, probably two years before that, 00:05:36.440 |
it became clear to me that supervised learning 00:05:47.680 |
It was, I would argue, an irrefutable argument, 00:05:57.300 |
then it could be configured to solve a hard task. 00:06:06.860 |
People weren't looking at large neural networks. 00:06:08.860 |
People were maybe studying a little bit of depth 00:06:14.180 |
wasn't even looking at neural networks at all. 00:06:16.060 |
They were looking at all kinds of Bayesian models 00:06:24.640 |
they actually can't represent a good solution 00:06:31.440 |
can represent a good solution to the problem. 00:06:40.400 |
and a lot of compute to actually do the work. 00:06:45.740 |
so we've worked on optimization for a little bit. 00:06:49.320 |
It was clear that optimization is a bottleneck, 00:06:53.140 |
and there was a breakthrough by another grad student 00:07:00.440 |
which is different from the one we're using now. 00:07:09.600 |
because before we didn't even know we could train them. 00:07:21.900 |
back then it seemed like this unbelievably difficult data set. 00:07:28.740 |
a large convolutional neural network on this data set, 00:07:30.980 |
it must succeed if you just can have the compute. 00:07:42.740 |
and somehow you had the observations at a GPU, 00:07:49.440 |
this is our couple of generations into a CUDA GPU, 00:07:56.560 |
You had the insight that the GPU could actually be useful 00:08:05.140 |
Tell me, you and I, you never told me that moment. 00:08:21.300 |
And we started trying and experimenting with them. 00:08:27.300 |
but it was unclear what to use them for exactly. 00:08:31.640 |
Where are you going to get the real traction? 00:08:33.880 |
But then, with the existence of the ImageNet data set, 00:08:46.640 |
so it should be possible to make it go unbelievably fast, 00:08:58.840 |
and, you know, very fortunately, Alex Krzyzewski, 00:09:06.340 |
And he was able to do it, he was able to code, 00:09:09.600 |
to program really fast convolutional kernels. 00:09:20.220 |
on the ImageNet data set, and that led to the result. 00:09:29.300 |
by such a wide margin that it was a clear discontinuity. 00:09:38.100 |
It's not so much, like, when you say break the record, 00:09:43.880 |
I think there's a different way to phrase it. 00:09:46.060 |
It's that that data set was so obviously hard, 00:09:50.980 |
and so obviously outside of reach of anything. 00:09:54.400 |
People are making progress with some classical techniques, 00:09:58.980 |
But this thing was so much better on the data set, 00:10:03.900 |
It's not just that it's just some competition. 00:10:06.600 |
It was a competition which, back in the day-- 00:10:18.720 |
that if you did a good job, that would be amazing. 00:10:43.420 |
you could see, led up to the chat GPT moment. 00:10:53.180 |
How would you approach intelligence from that moment 00:11:20.220 |
these amazing neural nets who are doing incredible things, 00:11:49.140 |
most of them were working in Google/DeepMind, 00:11:53.980 |
And then there were people picking up the skills, 00:11:55.860 |
but it was very, very scarce, very rare still. 00:12:17.600 |
one which I was especially excited about very early on, 00:12:22.600 |
is the idea of unsupervised learning through compression. 00:12:36.400 |
that unsupervised learning is this easy thing 00:13:04.340 |
And I really believed that really good compression 00:13:08.720 |
of the data will lead to unsupervised learning. 00:13:11.020 |
Now, compression is not language that's commonly used 00:13:16.580 |
to describe what is really being done until recently, 00:13:20.800 |
when suddenly it became apparent to many people 00:13:23.300 |
that those GPTs actually compress the training data. 00:13:26.460 |
You may recall the Ted Chiang New York Times article 00:13:34.220 |
in which training these autoregressive generative models 00:13:40.260 |
And intuitively, you can see why that should work. 00:13:45.120 |
you must extract all the hidden secrets which exist in it. 00:13:50.720 |
So that was the first idea that we were really excited about. 00:13:59.920 |
to the sentiment neuron, which I'll mention very briefly. 00:14:12.300 |
but it was very influential, especially in our thinking. 00:14:39.680 |
trained to predict the next token in Amazon reviews, 00:15:16.120 |
That's what we see with these GPT models, right? 00:15:20.940 |
I mean, at this point, it should be so clear to anyone. 00:15:28.780 |
where do I get the data for unsupervised learning? 00:15:35.100 |
If I could just make you predict the next character, 00:15:40.740 |
I could train a neural network model with that. 00:15:45.140 |
and masking, and other technology, other approaches, 00:15:52.500 |
that's unsupervised for unsupervised learning? 00:16:08.300 |
though that part is there as well, especially now. 00:16:21.900 |
that training these neural nets to predict the next token 00:17:16.460 |
will improve the performance of these models. 00:17:59.700 |
Did you see the evidence of GPT-1 through 3 first, 00:18:02.220 |
or was it intuition about the scaling law first? 00:18:21.180 |
is to figure out how to use the scale correctly. 00:18:29.260 |
The question is what to use it for precisely. 00:18:36.820 |
but there's another very important line of work 00:18:41.140 |
but I think now is a good time to make a detour, 00:19:20.260 |
And there is a whole competitive league for that game. 00:19:26.540 |
And so we train a reinforcement learning agent 00:20:04.980 |
but they really led up to some of the important work 00:20:20.020 |
morphed into reinforcement learning from human feedback. 00:20:36.020 |
There's a system around it that's fairly complicated. 00:20:54.860 |
and give it knowledge and so on and so forth? 00:21:11.700 |
in lots of different texts from the internet, 00:21:16.500 |
what we are doing is that we are learning a world model. 00:21:25.500 |
that we are just learning statistical correlations in text, 00:21:40.100 |
is some representation of the process that produced the text. 00:21:44.580 |
This text is actually a projection of the world. 00:22:05.860 |
their interactions in the situations that we are in. 00:22:51.700 |
If I had some random piece of text on the internet, 00:23:10.700 |
which will be truthful, that will be helpful, 00:23:14.180 |
that will follow certain rules and not violate them. 00:23:22.420 |
and the reinforcement learning from human teachers 00:23:27.380 |
It's not just reinforcement learning from human teachers. 00:23:37.060 |
But here we are not teaching it new knowledge. 00:23:41.980 |
We are teaching it, we are communicating with it. 00:24:02.300 |
So the second stage is extremely important too, 00:24:04.940 |
in addition to the first stage of the learn everything, 00:24:08.580 |
learn everything, learn as much as you can about the world 00:24:12.860 |
from the projection of the world, which is text. 00:24:16.580 |
- Now you could tell, you could fine-tune it, 00:24:19.300 |
you could instruct it to perform certain things. 00:24:23.380 |
Can you instruct it to not perform certain things 00:24:31.940 |
so that it doesn't wander out of that bounding box 00:24:35.660 |
and perform things that are unsafe or otherwise? 00:24:45.060 |
is indeed where we communicate to the neural network 00:24:48.940 |
anything we want, which includes the bounding box. 00:25:19.660 |
- Chad Chibiti came out just a few months ago. 00:25:21.980 |
Fastest growing application in the history of humanity. 00:25:41.580 |
that anyone has ever created for anyone to use. 00:25:49.580 |
it does things that are beyond people's expectation. 00:26:00.700 |
And if your instructions or prompts are ambiguous, 00:26:08.620 |
until your intents are understood by the application, 00:26:28.020 |
The performance of GPT-4 in many areas, astounding. 00:26:38.580 |
the number of tests that it's able to perform 00:26:43.660 |
at very capable levels, very capable human levels, astounding. 00:26:48.180 |
What were the major differences between Chat GPT and GPT-4 00:27:01.900 |
is a pretty substantial improvement on top of Chat GPT 00:27:19.780 |
maybe eight months ago, I don't remember exactly. 00:27:23.220 |
GPT is the first big difference between Chat GPT and GPT-4. 00:27:30.020 |
And that perhaps is the most important difference, 00:27:39.420 |
predicts the next word with greater accuracy. 00:27:52.380 |
This claim is now perhaps accepted by many at this point, 00:27:59.500 |
or not completely intuitive as to why that is. 00:28:04.220 |
and to give an analogy that will hopefully clarify 00:28:07.500 |
why more accurate prediction of the next word 00:28:10.860 |
leads to more understanding, real understanding. 00:28:28.140 |
Then let's say that at the last page of the book, 00:28:48.660 |
but by predicting those words better and better and better, 00:28:51.500 |
the understanding of the text keeps on increasing. 00:28:58.980 |
- Ilya, people say that deep learning won't lead to reasoning. 00:29:04.420 |
- That deep learning won't lead to reasoning. 00:29:09.380 |
figure out from all of the agents that were there 00:29:13.180 |
and all of their strengths or weaknesses or their intentions 00:29:18.180 |
and the context, and to be able to predict that word, 00:29:30.180 |
And so how is it that it's able to learn reasoning? 00:29:40.180 |
you know, one of the things that I was going to ask you 00:29:57.940 |
was not as good at that GPT-4 was much better at. 00:30:02.820 |
And there were some tests that neither are good at yet. 00:30:07.140 |
and some of it has to do with reasoning, it seems, 00:30:12.100 |
that it wasn't able to break maybe the problem down 00:30:24.980 |
And so is that an area that in predicting the next word, 00:30:37.660 |
that would enhance its ability to reason even further? 00:30:41.180 |
You know, reasoning isn't this super well-defined concept, 00:30:51.780 |
which is when you maybe, maybe when you go further, 00:30:56.580 |
where you're able to somehow think about it a little bit 00:31:00.700 |
and get a better answer because of your reasoning. 00:31:08.900 |
you know, maybe there is some kind of limitation 00:31:16.820 |
This has proven to be extremely effective for reasoning, 00:31:21.860 |
just how far the basic neural network will go. 00:31:24.260 |
I think we have yet to tap, fully tap out its potential. 00:31:29.260 |
But yeah, I mean, there is definitely some sense 00:31:36.420 |
where reasoning is still not quite at that level 00:31:41.420 |
as some of the other capabilities of the neural network, 00:31:45.860 |
though we would like the reasoning capabilities 00:31:51.380 |
I think that it's fairly likely that business as usual 00:31:55.900 |
will keep, will improve the reasoning capabilities 00:31:59.660 |
I wouldn't necessarily confidently rule out this possibility. 00:32:04.660 |
Yeah, because one of the things that is really cool 00:32:14.020 |
tell me first what you know, and then answer the question. 00:32:18.180 |
You know, usually when somebody answers a question, 00:32:20.140 |
if you give me the foundational knowledge that you have 00:32:23.580 |
or the foundational assumptions that you're making 00:32:27.060 |
that really improves my believability of the answer. 00:32:31.940 |
You're also demonstrating some level of reason, 00:32:43.580 |
The one way to think about what's happening now 00:32:59.980 |
for these neural networks being useful, truly useful. 00:33:08.060 |
that these neural networks hallucinate a little bit, 00:33:12.700 |
or maybe make some mistakes which are unexpected, 00:33:14.980 |
which you wouldn't expect the person to make, 00:33:23.460 |
But I think that perhaps with a little bit more research, 00:33:28.060 |
and perhaps a few more of the ambitious research plans, 00:33:33.060 |
you'll be able to achieve higher reliability as well. 00:33:37.780 |
that will allow us to have very accurate guardrails, 00:33:53.140 |
when it doesn't know, and do so extremely reliably. 00:33:57.580 |
So I'd say that these are some of the bottlenecks, really. 00:34:09.820 |
You know, speaking of factualness and factfulness, 00:34:19.900 |
a demonstration that links to a Wikipedia page. 00:34:31.500 |
Is it able to retrieve information from a factual place 00:34:44.220 |
does not have a built-in retrieval capability. 00:34:47.060 |
It is just a really, really good next-word predictor, 00:34:56.180 |
Yeah, I'm about to ask you about my technology. 00:35:14.020 |
it wouldn't surprise me if some of the people 00:35:21.740 |
and then populate the results inside the context, 00:35:45.740 |
GPT-4 has the ability to learn from text and images 00:35:56.460 |
First of all, the foundation of multi-modality learning, 00:36:16.340 |
help us understand how multi-modality enhances 00:36:20.540 |
the understanding of the world beyond text by itself. 00:36:25.540 |
And my understanding is that when you do multi-modality 00:36:35.220 |
learning, that even when it is just a text prompt, 00:36:43.700 |
Tell us about multi-modality at the foundation, 00:36:46.780 |
why it's so important and what's the major breakthrough 00:36:50.420 |
and the characteristic differences as a result. 00:36:54.020 |
So there are two dimensions to multi-modality, 00:37:06.220 |
The first reason is that multi-modality is useful. 00:37:13.180 |
vision in particular, because the world is very visual. 00:37:37.100 |
though still considerable, is not as big as it could be. 00:38:00.020 |
by learning from images in addition to learning from text. 00:38:09.700 |
though it is not as clear cut as it may seem. 00:38:36.020 |
- Does that include my own words in my own head? 00:38:50.060 |
like we don't get to see more than a few words a second, 00:38:59.860 |
to get as many sources of information as we can. 00:39:02.700 |
And we absolutely learn a lot more from vision. 00:39:20.140 |
about the world from text in a few billion words 00:39:33.700 |
Surely, one needs to see to understand colors. 00:39:43.180 |
who've never seen a single photon in their entire life, 00:39:47.660 |
if you ask them which colors are more similar to each other, 00:39:50.940 |
it will know that red is more similar to orange than to blue. 00:39:55.060 |
It will know that blue is more similar to purple 00:40:01.740 |
And one answer is that information about the world, 00:40:28.980 |
there are things which are impossible to learn 00:40:45.420 |
then of course the other sources of information 00:41:19.620 |
- And if I wanted to augment all of that with sound, 00:42:02.580 |
It's useful, it's an additional source of information. 00:42:13.780 |
both on the recognition side and on the production side. 00:42:29.420 |
Which one of the tests were performed well by GPT-3, 00:42:38.580 |
How did multimodality contribute to those tests, 00:42:44.260 |
- Oh, I mean, in a pretty straightforward way, 00:42:48.900 |
anytime there was a test where a problem would, 00:42:55.780 |
Like, for example, in some math competitions. 00:43:05.700 |
And there, presumably, many of the problems have a diagram. 00:43:22.460 |
but it's like maybe from 2% to 20% accuracy of success rate. 00:43:27.460 |
But then when you add vision, it jumps to 40% success rate. 00:43:35.100 |
And I think being able to reason visually as well 00:43:38.980 |
and communicate visually will also be very powerful 00:43:43.860 |
which go beyond just learning about the world. 00:43:50.220 |
You can then reason about the world visually. 00:43:54.820 |
Where now, in the future, perhaps, in some future version, 00:43:57.860 |
if you ask your neural net, "Hey, explain this to me," 00:44:02.300 |
it will produce, "Hey, here's a little diagram 00:44:10.260 |
You know, one of the things that you said earlier 00:44:11.700 |
about an AI generating a test to train another AI, 00:44:16.700 |
you know, there was a paper that was written about, 00:44:21.060 |
and I don't completely know whether it's factual or not, 00:44:29.540 |
to something like 20 trillion useful, you know, 00:44:41.620 |
And that would have run out of tokens to train. 00:45:06.140 |
but we train our brain with generated data all the time 00:45:15.940 |
working through a problem in our brain, you know, 00:45:19.660 |
and, you know, I guess neuroscientists suggest sleeping. 00:45:28.140 |
How do you see this area of synthetic data generation? 00:45:32.900 |
Is that going to be an important part of the future 00:45:38.380 |
- Well, I think, like I wouldn't underestimate 00:45:45.020 |
I think there's probably more data than people realize. 00:45:59.940 |
that one of these days our AIs are, you know, 00:46:06.220 |
maybe generating either adversarial content for itself 00:46:16.940 |
Tell us whatever you can about where we are now 00:46:23.020 |
and what do you think we'll be in not too distant future, 00:46:26.860 |
but, you know, pick your horizon, a year or two. 00:46:31.420 |
What do you think this whole language model area would be 00:46:34.060 |
in some of the areas that you're most excited about? 00:46:38.660 |
And it's a bit, although it's a little difficult 00:46:46.580 |
I think it's safe to assume that progress will continue. 00:46:52.580 |
And that we will keep on seeing systems which astound us 00:47:00.820 |
And the current frontiers will be centered around reliability 00:47:09.020 |
really get into a point where we can trust what it produces, 00:47:26.300 |
the areas where improvement will lead to the biggest impact 00:47:32.860 |
Because right now that's really what stands in the way. 00:47:36.780 |
you ask a neural net to maybe summarize some long document 00:47:40.980 |
Like, are you sure that some important detail 00:47:48.940 |
that all the important points have been covered. 00:48:04.780 |
then the neural network will also recognize that reliably. 00:48:16.860 |
So I think we'll see a lot of that in the next two years. 00:48:22.300 |
will make this technology trusted by people to use 00:48:28.940 |
I was thinking that was gonna be the last question, 00:48:30.660 |
but I did have another one, sorry about that. 00:48:40.420 |
what are some of the skills that it demonstrated 00:48:47.180 |
- Well, there were lots of really cool things 00:49:05.220 |
I'm just trying to think about the best way to go about it. 00:49:09.620 |
The short answer is that the level of its reliability 00:49:28.540 |
Its ability to solve math problems became far greater. 00:49:31.940 |
It was like you could really do the derivation, 00:49:45.100 |
- Not all proofs, naturally, but quite a few. 00:49:50.060 |
like many people noticed that it has the ability 00:50:02.060 |
- It follows instructions really, really clearly. 00:50:04.780 |
- Not perfectly still, but much better than before. 00:50:14.380 |
You show it a meme and ask it why it's funny, 00:50:16.300 |
and it will tell you, and it will be correct. 00:50:21.540 |
was also very, it's like really actually seeing it 00:50:27.500 |
about some complicated image with a complicated diagram 00:50:33.420 |
But yeah, overall, I will say, to take a step back, 00:50:38.220 |
you know, I've been in this business for quite some time. 00:50:55.220 |
- Like it turned out to be the same little thing all along, 00:51:00.300 |
and a lot more serious and much more intense, 00:51:03.500 |
but it's the same neural network, just larger, 00:51:06.740 |
trained on maybe larger data sets in different ways 00:51:09.700 |
with the same fundamental training algorithm. 00:51:14.820 |
I would say this is what I find the most surprising. 00:51:19.220 |
- Whenever I take a step back, I go, how is it possible 00:51:21.420 |
that those ideas, those conceptual ideas about, 00:51:26.420 |
so maybe artificial neurons are just as good, 00:51:29.180 |
and so maybe we just need to train them somehow 00:51:32.020 |
that those arguments turned out to be so incredibly correct. 00:51:39.900 |
- In the 10 years that we've known each other, 00:52:01.900 |
would have believed that the amount of computation 00:52:09.980 |
and that you dedicated your career to go do that. 00:52:13.980 |
You've done many more, your body of work is incredible, 00:52:22.620 |
the co-invention with AlexNet and that early work, 00:52:28.820 |
it is truly remarkable what you've accomplished. 00:52:35.420 |
I'm a good friend and it is quite an amazing moment. 00:52:40.420 |
And today's talk, the way you break down the problem 00:52:45.740 |
and describe it, this is one of the best PhD, 00:52:50.740 |
beyond PhD descriptions of the state-of-the-art