back to indexExponential Progress of AI: Moore's Law, Bitter Lesson, and the Future of Computation
Chapters
0:0 Overview
0:37 Bitter Lesson by Rich Sutton
6:55 Contentions and opposing views
9:10 Is evolution a part of search, learning, or something else?
10:51 Bitter Lesson argument summary
11:42 Moore's Law
13:37 Global compute capacity
15:43 Massively parallel computation
16:41 GPUs and ASICs
17:17 Quantum computing and neuromorphic computing
19:25 Neuralink and brain-computer interfaces
21:28 Deep learning efficiency
22:57 Open questions for exponential improvement of AI
28:22 Conclusion
00:00:00.000 |
This video is looking at exponential progress 00:00:02.280 |
for artificial intelligence from a historical perspective 00:00:05.120 |
and anticipating possible future trajectories 00:00:07.640 |
that may or may not lead to exponential progress of AI. 00:00:11.640 |
At the center of this discussion is a blog post 00:00:17.160 |
which ties together several different concepts, 00:00:20.000 |
specifically looking at the role of computation 00:00:29.040 |
is part of the AI Paper Club on our Discord server. 00:00:32.540 |
If you wanna join the discussion, everyone is welcome. 00:00:37.260 |
So I'd like to first discuss the argument made 00:00:50.000 |
that will carry the flag of exponential improvement in AI, 00:00:58.980 |
or a bunch of other ideas in both hardware and software. 00:01:07.380 |
The central idea is that most of the improvement 00:01:10.080 |
in artificial intelligence over the past 70 years 00:01:12.860 |
has occurred due to the improvement of computation 00:01:24.100 |
And so it wasn't the innovation in the algorithms, 00:01:34.020 |
were the ones associated with successful progress of AI. 00:01:47.900 |
human expertise injected methods that leverage small compute. 00:01:54.020 |
I'm referring to any computational resources available today 00:02:00.620 |
of computational resources over the past many decades 00:02:05.980 |
is much smaller than anything you'll have tomorrow. 00:02:13.180 |
of human expertise, human knowledge injection, 00:02:18.840 |
is better than AI that encodes human expertise 00:02:32.340 |
are learning techniques and search techniques. 00:02:41.140 |
Garry Kasparov, IBM D-Blue in the game of chess. 00:02:44.900 |
These are these brute force search techniques 00:03:02.180 |
Now, the reason I call self-play mechanism brute force 00:03:05.340 |
is because the reinforcement learning methods of today 00:03:08.580 |
are fundamentally wasteful in terms of their, 00:03:13.140 |
And that's the critical thing about brute force methods 00:03:17.100 |
that these methods are able to leverage computation. 00:03:30.060 |
and therefore as computation exponentially grows, 00:03:35.720 |
And the blog post provides a few other examples 00:03:37.740 |
of speech recognition that started with heuristics 00:03:39.900 |
that went to the statistical methods of HMMs. 00:03:45.180 |
in speech recognition and natural language processing 00:03:51.840 |
the fine-tuned human expertise feature selection 00:03:59.700 |
And then finally with the big image net moment 00:04:02.360 |
and showed that neural networks able to discover 00:04:04.820 |
automatically the hierarchy of features required 00:04:07.600 |
to successfully complete different computer vision tasks. 00:04:10.400 |
I think this is a really thought-provoking blog post 00:04:14.280 |
because it suggests that when we develop methods, 00:04:17.160 |
whether it's in the software or the hardware, 00:04:19.160 |
we should be thinking about long-term progress, 00:04:26.120 |
but in five years, 10 years, 20 years from now. 00:04:28.840 |
So when you look at the progress of the field 00:04:31.560 |
from that perspective, there's certain things 00:04:35.440 |
And Rich argues that actually majority of things 00:04:39.600 |
that we work on in the artificial intelligence community, 00:04:44.980 |
is too focused on the injection of human expertise 00:04:51.260 |
incremental improvement that you can publish on 00:04:54.420 |
and then sort of get, add publications to your resume, 00:05:09.020 |
I think there is something from a human psychologist 00:05:20.780 |
In fact, if you look at the brute force search 00:05:35.660 |
because of the publicity of the actual matches 00:05:37.680 |
they were involved in, but the scientific community, 00:05:43.460 |
to the scientific contribution of these general methods. 00:05:50.380 |
I would love to see that when people publish papers today, 00:06:11.220 |
I think that's a really good question to ask. 00:06:19.700 |
and provocative question that all graduate students 00:06:22.860 |
and faculty and researchers should be asking themselves 00:06:32.260 |
because I think we often give a disproportionate 00:06:35.060 |
amount of respect, I think, to algorithmic improvement 00:06:40.060 |
and not enough respect when we look at the big arc 00:06:43.380 |
of progress in artificial intelligence to computation, 00:06:48.060 |
whether that's talking about just the raw transistor count 00:06:52.020 |
or other aspects of improving the computational process. 00:07:02.620 |
First, the blog post doesn't mention anything about data. 00:07:06.740 |
And in terms of learning, if we look at the kind 00:07:11.900 |
for real-world applications, it's supervised learning, 00:07:15.220 |
meaning it's learning that uses human annotation of data. 00:07:26.300 |
with the scalability of being able to annotate data. 00:07:44.620 |
but in general, those two are not directly linked, 00:08:00.900 |
in raw computation power as observed by Moore's law. 00:08:04.700 |
But of course, you can also generalize this blog post 00:08:12.940 |
So computation improvement at any level of abstraction, 00:08:18.860 |
at the highest level of abstraction of deep learning 00:08:25.940 |
onto the exponential improvement in these contexts, 00:08:29.100 |
it's able to ride the wave of exponential improvement. 00:08:33.100 |
It's just that the main exponential improvement 00:08:36.020 |
we've seen in the past 70 years is that of Moore's law. 00:08:39.620 |
Another contention that I personally don't find 00:08:54.140 |
And the reason I don't find that very convincing 00:09:01.100 |
for deep learning methods is just much smaller 00:09:10.900 |
Now, one big, I don't know if it's a contention, 00:09:19.820 |
to think about the existence proof that we have before us, 00:09:24.700 |
And I think it's fair to say that the process 00:09:31.380 |
Now, the question as it relates to the blog post, to me, 00:09:36.020 |
is whether evolution falls under the category 00:09:46.980 |
Or is it a completely different kind of thing? 00:09:57.280 |
in terms of our best performing methods of today? 00:10:17.420 |
and the learning methods that leverage computation so well? 00:10:37.600 |
the evolutionary process seems to be very brute force, 00:10:41.780 |
So in that way, perhaps it does have similarities 00:10:45.140 |
and the brute force learning self-play mechanisms 00:10:48.660 |
that we see so successfully leveraging computation. 00:10:51.820 |
So to summarize the argument made in the bitter lesson, 00:10:54.840 |
the exponential progress of AI over the past 60, 70 years 00:10:59.140 |
was coupled to the exponential progress of computation 00:11:03.020 |
with Moore's law and the doubling of transistors. 00:11:05.540 |
And as we stand today, the open question then is, 00:11:15.400 |
So invention of new, better, clever algorithms, 00:11:28.180 |
I'll talk about my bets for this open question 00:11:49.680 |
One is the precise technical meaning or the actual meaning, 00:11:53.520 |
which is the doubling of transistor count every two years. 00:11:57.000 |
Or you can look at it from a financial perspective 00:12:04.920 |
and different kinds of processes together on the same plot. 00:12:13.000 |
is the general sense that there's an exponential improvement 00:12:18.240 |
And I'm actually personally okay with that use of Moore's law 00:12:21.480 |
as we generalize across different technologies 00:12:35.140 |
over the past several decades, is Moore's law dead? 00:12:43.840 |
And then there's a few folks like Jim Keller of now Intel. 00:12:47.280 |
I did a podcast with him, I highly recommend it. 00:12:50.720 |
Says no, because actually when we look at the size 00:12:57.760 |
of how small we can get with the transistors. 00:13:00.880 |
Now it gets extremely difficult for many reasons 00:13:04.620 |
to get a transistor that starts approaching the size 00:13:12.840 |
in terms of what's required for actual fabrication 00:13:18.040 |
But the theoretical physics limit hasn't been reached 00:13:22.340 |
But also if we look at the broader definition 00:13:29.820 |
there's a lot of other candidates, flag bearers, 00:13:48.960 |
The question being how much total compute capacity 00:13:55.240 |
And looking historically, how has it been increasing? 00:14:02.800 |
but I do wish there was a little bit more data. 00:14:04.960 |
I'm actually really excited by the potential of this 00:14:16.900 |
of general compute capable devices in the world. 00:14:22.920 |
that appeared over the past 20 years is gaming consoles. 00:14:38.200 |
but also the compute surface across all types of devices. 00:14:48.860 |
means that that computation can be then leveraged 00:14:52.920 |
And then we can look at an entirely other dimension 00:15:00.920 |
of virtual reality and augmented reality devices. 00:15:19.800 |
So I'm actually really excited by the potential things 00:15:24.600 |
in terms of the exponential growth of actual devices, 00:15:29.880 |
The exponential expansion of compute surfaces in our world. 00:15:36.280 |
That might force us to rethink the nature of computation, 00:15:46.240 |
another possibility of exponential growth of AI 00:16:03.040 |
and all kinds of challenges that characterize 00:16:06.440 |
that as you increase the number of processors, 00:16:23.580 |
that are perfectly parallelizable across thousands, 00:16:27.500 |
maybe millions, maybe billions of processors, 00:16:33.420 |
and allows us to exponentially improve the AI algorithms 00:16:47.420 |
is devices that are at their core parallelizable. 00:16:56.300 |
or ones that are actually specific to neural networks 00:17:00.980 |
which is ASICs, application specific integrated circuits. 00:17:04.620 |
The TPU by Google being an excellent example of that 00:17:07.140 |
where there's a bunch of hardware design decisions made 00:17:25.640 |
is changing the actual nature of computation. 00:17:30.020 |
So a completely different kind of computation. 00:17:38.900 |
You're probably familiar with quantum computers, 00:17:50.020 |
So there is a lot of excitement and development 00:17:52.740 |
in this space, but it's, I would say, very early days, 00:18:02.140 |
First, it's really hard to build large quantum computers, 00:18:15.100 |
especially in the space of artificial intelligence 00:18:31.540 |
Now, the idea, I think, with neuromorphic computing 00:18:38.300 |
One of the characteristic things about the human brain 00:18:42.700 |
is it's much more energy efficient than our computers 00:18:48.720 |
So neuromorphic computing is trying to achieve 00:18:55.340 |
and it's unclear how you can design general algorithms 00:18:59.300 |
that reach even close to the same performance 00:19:04.940 |
run on classical computers of today with GPUs or ASICs. 00:19:08.620 |
But of course, if you want to have a complete shift, 00:19:15.240 |
we approach computation and artificial intelligence, 00:19:17.880 |
a computer which functions in a completely different way 00:19:24.540 |
that might be able to achieve that kind of phase shift. 00:19:27.180 |
Now, another really exciting space of methodologies 00:19:44.780 |
for computation, now that's a weird way to put it, 00:19:48.140 |
but we have a lot of compute power in our brains. 00:19:50.900 |
We're actually doing a lot of computation, each one of us, 00:20:00.340 |
to share the outcome of that computation with the world. 00:20:04.940 |
We share it with a very low bandwidth channel. 00:20:13.020 |
it's interesting to consider if we can create 00:20:15.420 |
a high bandwidth connection between a computer 00:20:17.220 |
and a human brain, then we're able to leverage 00:20:21.100 |
the computation the human brain already provides 00:20:24.000 |
to be able to add to the global compute capacity 00:20:38.380 |
about brain-computer interfaces, the way, for example, 00:20:41.100 |
Elon Musk talks about Neuralink, it's often talked about 00:20:44.420 |
from an individual perspective of increasing your ability 00:20:48.220 |
to communicate with the world and receive information 00:20:50.260 |
from the world, but if you look from a society perspective, 00:20:53.460 |
you're now able to leverage the computational power 00:21:01.620 |
to survive in our daily lives, able to leverage that 00:21:14.780 |
computing machine, so if you can connect into that 00:21:18.340 |
and share that computation, I think incredible 00:21:26.420 |
without significant innovation on the algorithm side. 00:21:29.380 |
Now, a lot of the previous things we talked about 00:21:33.860 |
very low-level software side of exponential improvement. 00:21:37.300 |
I really like the recent paper from Danny Hernandez 00:21:41.180 |
and others at OpenAI called "Measuring the Algorithmic 00:21:43.820 |
Efficiency of Neural Networks" that looks at different 00:21:46.300 |
kind of domains of machine learning and deep learning 00:21:48.800 |
and shows that the efficiency of the algorithms involved 00:21:51.620 |
has increased exponentially, actually far outpacing 00:21:58.540 |
starting from the ImageNet moment with AlexNet, 00:22:03.140 |
if we look at AlexNet in 2012 and then EfficientNet in 2019 00:22:11.140 |
the improvement is it takes 44 times less computation 00:22:16.140 |
to train a neural network to the level of AlexNet. 00:22:19.020 |
So if we look at Moore's law in the same kind of 00:22:29.540 |
And the paper highlights also the same kind of 00:22:32.180 |
exponential improvements in natural language, 00:22:36.220 |
So the open question raises is maybe with deep learning, 00:22:42.060 |
that the algorithmic process may yield more gains 00:22:51.500 |
because that means human ingenuity will be essential 00:22:55.040 |
for the continuation exponential improvement of AI. 00:22:58.700 |
All that said, whether AI will continue to improve 00:23:06.460 |
I don't know, I change my mind every day on most things, 00:23:08.540 |
but today I feel AI will continue to improve exponentially. 00:23:15.820 |
It's not a single sort of nice exponential improvement. 00:23:19.140 |
It's always kind of big breakthrough innovation 00:23:30.180 |
where will the S-curves that feed the exponential 00:23:45.860 |
So the efficiency of learning and search processes, 00:23:50.740 |
You know, there's a lot of terminology swimming around 00:24:01.620 |
but you can think of self-supervised learning 00:24:09.980 |
So that's looking at mechanisms that are extremely powerful 00:24:17.220 |
you have the idea would be that you would have an algorithm 00:24:28.100 |
in an unsupervised way, just by observing the world. 00:24:31.580 |
Now, for me, I'm excited by active learning much more, 00:24:34.740 |
which is the optimization of the way you select the data 00:24:42.700 |
I'm going to learn from smaller and smaller data sets, 00:24:53.260 |
I think a really simple, but exciting example 00:24:55.900 |
of that in the real world is what the Tesla Autopilot team 00:25:00.660 |
where there's a multitask learning framework, 00:25:06.140 |
And there's a pipeline for discovering edge cases 00:25:10.060 |
And you keep feeding back the edge cases discovered, 00:25:13.700 |
and then you keep feeding those edge cases back 00:25:18.620 |
And then there's a shared part of the network 00:25:22.180 |
And so there's this active learning framework 00:25:27.660 |
as it continually discovers and learns from the edge cases. 00:25:37.500 |
So innovation and learning in terms of its ability 00:25:40.060 |
to discover just the right data to improve its performance. 00:25:44.900 |
And I believe the performance of active learning 00:25:46.860 |
can increase exponentially in the coming years. 00:25:49.620 |
Another source of ESCO is that I'm really excited about, 00:25:55.500 |
is the general expansion of the compute surfaces 00:26:12.140 |
And we're not talking about Alexa here or there. 00:26:14.480 |
We're talking about just everything as a compute surface. 00:26:19.340 |
I think that's a really exciting possibility of the future. 00:26:23.300 |
but I certainly hope to be part of the people 00:26:32.100 |
To me, the total game changer that we don't expect 00:26:38.820 |
in the context of Neuralink, is brain-computer interfaces. 00:26:44.540 |
for helping understand and treat neurological diseases. 00:26:56.780 |
that's going to change the nature of computation 00:26:59.480 |
and the nature of artificial intelligence completely. 00:27:02.380 |
If an AI system can communicate with the human brain 00:27:05.180 |
and each leveraging each other's computation, 00:27:16.760 |
but at this time, it's shrouded in uncertainty. 00:27:24.920 |
it's the folks working on brain-computer interfaces 00:27:28.840 |
and the brilliant engineers working at Neuralink. 00:27:32.920 |
When you talk about exponential improvement in AI, 00:27:39.480 |
"Is it 2030, 2045, 2050, a century from now?" 00:27:49.920 |
I think we're living through the singularity. 00:27:52.520 |
I think the smoothness of the exponential improvement 00:27:55.960 |
that we've been a part of in artificial intelligence 00:27:59.600 |
is sufficiently smooth to where we don't even sense 00:28:02.400 |
the madness of the curvature of the improvement 00:28:08.860 |
and every new stage we just so quickly take for granted. 00:28:13.400 |
I think we're living through the singularity, 00:28:15.120 |
and I think we'll continue adapting incredibly well 00:28:24.320 |
So that was my simple attempt to discuss some of the ideas 00:28:27.240 |
by Rich Sutton in his blog post, "The Bitter Lesson," 00:28:30.720 |
and the broader context of exponential improvement in AI 00:28:33.720 |
and the role of computation and algorithmic improvement 00:28:45.320 |
It's people from all walks of life, all levels of expertise, 00:28:49.240 |
from artists to musicians to neuroscientists to physicists. 00:28:56.800 |
It's, I think, something special, so join us anytime. 00:29:01.440 |
If you have suggestions for papers we should cover, 00:29:04.680 |
Otherwise, thanks for watching, and I'll see you next time.