Exponential Progress of AI: Moore's Law, Bitter Lesson, and the Future of Computation

This video is looking at exponential progress for artificial intelligence from a historical perspective and anticipating possible future trajectories that may or may not lead to exponential progress of AI. At the center of this discussion is a blog post called "The Bitter Lesson" by Rich Sutton, which ties together several different concepts, specifically looking at the role of computation in the progress of artificial intelligence and computer science in general.

This blog post and the broader discussion is part of the AI Paper Club on our Discord server. If you wanna join the discussion, everyone is welcome. Link is in the description. So I'd like to first discuss the argument made in "The Bitter Lesson" by Rich Sutton that discusses the role of computation in the progress of artificial intelligence.

And then I'd like to look into the future and see what are the possible ideas that will carry the flag of exponential improvement in AI, whether it is in computation with the continuation of Moore's law, or a bunch of other ideas in both hardware and software. So "The Bitter Lesson," the basic argument contains several ideas.

The central idea is that most of the improvement in artificial intelligence over the past 70 years has occurred due to the improvement of computation versus improvement in algorithms. And when I say improvement of computation, I mean Moore's law, transistor count, doubling every two years. And so it wasn't the innovation in the algorithms, but instead the same brute force algorithms that were sufficiently general and were effective at leveraging computation were the ones associated with successful progress of AI.

So put another way, general methods that are automated and can leverage big compute are better than specialized, fine-tuned, human expertise injected methods that leverage small compute. And when I say small compute, I'm referring to any computational resources available today because with the exponential growth of computational resources over the past many decades with Moore's law, basically anything you have today is much smaller than anything you'll have tomorrow.

That's how exponential growth works. And looking from yet another perspective of human expertise, human knowledge injection, AI that discovers a solution by itself is better than AI that encodes human expertise and human knowledge. Rich Sutton also in his blog post argues that the two categories of techniques that were most capable of leveraging massive amounts of computation are learning techniques and search techniques.

Now, by way of example, you can think of search techniques as the ones that were used to beat Garry Kasparov, IBM D-Blue in the game of chess. These are these brute force search techniques that were criticized at the time for being for their brute force nature. And the same I would say is the brute force learning techniques of Google DeepMind that beat the world champion at the game of Go.

Now, the reason I call self-play mechanism brute force is because the reinforcement learning methods of today are fundamentally wasteful in terms of their, how efficient they are at learning. And that's the critical thing about brute force methods that Rich Sutton argues that these methods are able to leverage computation.

So they may not be efficient or they may not have the kind of cleverness that human expertise might provide, but they're able to leverage computation and therefore as computation exponentially grows, they're able to outperform everything else. And the blog post provides a few other examples of speech recognition that started with heuristics that went to the statistical methods of HMMs.

And finally, now the recent big successes in speech recognition and natural language processing in general with neural networks. And the same in the computer vision world, the fine-tuned human expertise feature selection of everything that led up to SIFT. And then finally with the big image net moment and showed that neural networks able to discover automatically the hierarchy of features required to successfully complete different computer vision tasks.

I think this is a really thought-provoking blog post because it suggests that when we develop methods, whether it's in the software or the hardware, we should be thinking about long-term progress, the impact of our ideas, not for this year, but in five years, 10 years, 20 years from now.

So when you look at the progress of the field from that perspective, there's certain things that are not gonna hold up. And Rich argues that actually majority of things that we work on in the artificial intelligence community, especially in the academic circles, is too focused on the injection of human expertise because that is how you're able to get incremental improvement that you can publish on and then sort of get, add publications to your resume, you have career success and progress, and you feel better because you're injecting your own expertise into the system as opposed to having these, quote-unquote, dumb brute force approaches.

I think there is something from a human psychologist perspective about brute force methods just not being associated with innovative, brilliant thinking. In fact, if you look at the brute force search or the brute force learning approaches, I think at the time, if we look at it today, the publications and the science associated with these methods, I think, did not get the recognition they deserve.

They got a huge amount of recognition because of the publicity of the actual matches they were involved in, but the scientific community, I don't think, gave enough respect to the scientific contribution of these general methods. As interesting, thought-provoking idea, I would love to see that when people publish papers today, it maybe almost have like a section where they describe if computation was able to be scaled by 10x by 100x, looking five, 10 years down the future, will this method hold up to that scaling?

Is it scalable? Is this method fundamentally scalable? I think that's a really good question to ask. Is this something that would benefit, at least scale linearly with compute? That, to me, is a really interesting and provocative question that all graduate students and faculty and researchers should be asking themselves about the methods they propose.

Overall, I think this blog post serves as a really good thought experiment because I think we often give a disproportionate amount of respect, I think, to algorithmic improvement and not enough respect when we look at the big arc of progress in artificial intelligence to computation, to the improvement of computation, whether that's talking about just the raw transistor count or other aspects of improving the computational process.

If we look at this blog post as it is, you can, of course, raise some contentions and some opposing views. First, the blog post doesn't mention anything about data. And in terms of learning, if we look at the kind of learning that's been really successful for real-world applications, it's supervised learning, meaning it's learning that uses human annotation of data.

And so the scalability of learning methods with computation also needs to be coupled with the scalability of being able to annotate data. And it's unclear to me how the scalability with computation is naturally scaled with annotation of data. Now, I'll propose some ideas there later on. I think they're super exciting in the space of active learning, but in general, those two are not directly linked, at least in the argument of the blog post.

So to be fair, the blog post is looking at the historical context of progress in AI. And so in that way, it's looking at methods that leverage the exponential improvement in raw computation power as observed by Moore's law. But of course, you can also generalize this blog post to say really any methods that hook onto any kind of exponential improvement.

So computation improvement at any level of abstraction, including as we'll later talk about, at the highest level of abstraction of deep learning or even meta-learning. As long as these methods can hook onto the exponential improvement in these contexts, it's able to ride the wave of exponential improvement. It's just that the main exponential improvement we've seen in the past 70 years is that of Moore's law.

Another contention that I personally don't find very convincing is when you say that learning or search methods don't require much human expertise, well, they kind of do. You still need to do some fine tuning. There's still a bunch of tricks even though it's at the higher level. And the reason I don't find that very convincing is because I think the amount and the quality of human expertise required for deep learning methods is just much smaller and much more directed than in classical machine learning methods, or especially in heuristic-based methods.

Now, one big, I don't know if it's a contention, but it's an open question for me. It's often useful when we try to chase the creation of intelligent systems to think about the existence proof that we have before us, which is our own brain. And I think it's fair to say that the process that created the intelligence of our brain is the evolutionary process.

Now, the question as it relates to the blog post, to me, is whether evolution falls under the category of search methods or of learning methods, or some mix of the two. Is it a subset, a combination of the two, or is it a superset? Or is it a completely different kind of thing?

I think that's a really interesting and really difficult question for me. That I think about often. What is the evolutionary process in terms of our best performing methods of today? Of course, there's genetic algorithms, there's genetic programming. These are very kind of specialized, evolution-inspired methods. But the actual evolutionary process that created life on Earth, that created intelligent life on Earth, how does that relate to the search and the learning methods that leverage computation so well?

It does seem from a 10,000 foot level that the evolutionary process, whether it relates to search or learning, is the kind of process that would leverage computation very well. In fact, from a human-centric perspective of a human that values his life, the evolutionary process seems to be very brute force, very wasteful.

So in that way, perhaps it does have similarities to the brute force search and the brute force learning self-play mechanisms that we see so successfully leveraging computation. So to summarize the argument made in the bitter lesson, the exponential progress of AI over the past 60, 70 years was coupled to the exponential progress of computation with Moore's law and the doubling of transistors.

And as we stand today, the open question then is, if we look at the possibility of future exponential improvement of artificial intelligence, will that be due to human ingenuity? So invention of new, better, clever algorithms, or will it be due to improvement, increase in raw computation power? Or I think a distinct option is both.

I'll talk about my bets for this open question at the end of the video, but at this time, let's talk about some possible flag bearers of exponential improvement in AI in the coming years and decades. First, let's look at Moore's law, which is an observation, it's not a law.

It has two meanings, I would say. One is the precise technical meaning or the actual meaning, which is the doubling of transistor count every two years. Or you can look at it from a financial perspective and look at dollars per flop, a decrease in exponentially. This allows you to compare CPUs and GPUs and different kinds of processes together on the same plot.

And the second meaning, I think that's very commonly used in general public discourse, is the general sense that there's an exponential improvement of computational capabilities. And I'm actually personally okay with that use of Moore's law as we generalize across different technologies and different ideas to use Moore's law to mean the general observation of the exponential improvement of computational capabilities.

So the question that's been asked many times over the past several decades, is Moore's law dead? I think has two camps. Majority of the industry says yes. And then there's a few folks like Jim Keller of now Intel. I did a podcast with him, I highly recommend it. Says no, because actually when we look at the size of transistors, we have not yet hit the theoretical physics limit of how small we can get with the transistors.

Now it gets extremely difficult for many reasons to get a transistor that starts approaching the size of a single nanometer in terms of power, in terms of error correction, in terms of what's required for actual fabrication of that kind of scale of thing. But the theoretical physics limit hasn't been reached so Moore's law can continue.

But also if we look at the broader definition of just exponential improvement of computational capabilities, there's a lot of other candidates, flag bearers, as I mentioned, that could carry that exponential flag forward. And let's look at them now. One is the global compute capacity. Now this one is really interesting.

And I actually had trouble finding good data to answer the very basic question. I don't think that data exists. The question being how much total compute capacity is there in the world today? And looking historically, how has it been increasing? There's a few kind of speculative studies. Some of them I cite here.

They're really interesting, but I do wish there was a little bit more data. I'm actually really excited by the potential of this in a way that is completely unexpected potentially in the future. Now, what are we talking about? We're talking about the actual number of general compute capable devices in the world.

One of the really powerful compute devices that appeared over the past 20 years is gaming consoles. The other one, I mean, past maybe 10 years is a smartphone devices. Now, if we look into the future, the possibility, first of all, smartphone devices growing exponentially, but also the compute surface across all types of devices.

So if we think of internet of things, every object in our day-to-day life gaining computational capabilities means that that computation can be then leveraged in some distributed way. And then we can look at an entirely other dimension of devices that could explode exponentially in the near or the long-term future of virtual reality and augmented reality devices.

So currently both of those types of devices are not really gaining ground, but it's very possible that in the future, a huge amount of computational resources become available for these virtual worlds, for augmented worlds. So I'm actually really excited by the potential things that we can't yet expect in terms of the exponential growth of actual devices, which are able to do computation.

The exponential expansion of compute surfaces in our world. That's really interesting. That might force us to rethink the nature of computation, to push it more and more towards like distributed computation. So speaking of distributed computation, another possibility of exponential growth of AI is just massively parallel computation. So increasing CPUs, GPUs, stacking them on top of each other and increasing that stack exponentially.

Now you run up against Amdahl's law and all kinds of challenges that characterize that as you increase the number of processors, it becomes more and more difficult. There's a diminishing return in terms of the compute speed up you gain when you add more processors. Now, if we can overcome that Amdahl's law, if we can successfully design algorithms that are perfectly parallelizable across thousands, maybe millions, maybe billions of processors, then that changes the game.

That changes the game and allows us to exponentially improve the AI algorithms by exponentially increasing the number of processors involved. Another dimension of approaches that contribute to exponential growth of AI is devices that are at their core parallelizable. More general devices like the GPUs, graphic processing units, or ones that are actually specific to neural networks or whatever the algorithm is, which is ASICs, application specific integrated circuits.

The TPU by Google being an excellent example of that where there's a bunch of hardware design decisions made that are specialized to machine learning, allowing it to be much more efficient in terms of both energy use and the actual performance of the algorithm. Now, another big space that I could probably divide in many slides of flag bearers for exponential AI growth is changing the actual nature of computation.

So a completely different kind of computation. So two exciting candidates shown here, one is quantum computing and the other is neuromorphic computing. You're probably familiar with quantum computers, with qubits, that versus classical computers that only represent zeros and ones, qubits also represent zero ones and the superposition of zero ones.

So there is a lot of excitement and development in this space, but it's, I would say, very early days, especially considering general methods that are able to leverage computation. First, it's really hard to build large quantum computers, but even if you can, second, it's very hard to build algorithms that significantly outperform the algorithms on classical computers, especially in the space of artificial intelligence with machine learning.

Then there's another space of computing called neuromorphic computing that draws a lot of inspiration, a lot more inspiration from the human brain. Specifically, it models spiking networks. Now, the idea, I think, with neuromorphic computing is it's able to perform computation in a much more efficient way. One of the characteristic things about the human brain versus our computers today is it's much more energy efficient than our computers for the same amount of computation.

So neuromorphic computing is trying to achieve the same kind of performance. Again, very early days, and it's unclear how you can design general algorithms that reach even close to the same performance of machine learning algorithms, for example, run on classical computers of today with GPUs or ASICs. But of course, if you want to have a complete shift, like a phase shift in terms of the way we approach computation and artificial intelligence, a computer which functions in a completely different way than our classical computers is something that might be able to achieve that kind of phase shift.

Now, another really exciting space of methodologies is brain-computer interfaces. In the short term, it's exciting because it may help us understand and treat neurological diseases. But in the long term, the possibility of leveraging human brains for computation, now that's a weird way to put it, but we have a lot of compute power in our brains.

We're actually doing a lot of computation, each one of us, every living moment of our lives. And the unfortunate thing is we're not able to share the outcome of that computation with the world. We share it with a very low bandwidth channel. So not from an individual perspective, but from a perspective of society, it's interesting to consider if we can create a high bandwidth connection between a computer and a human brain, then we're able to leverage the computation the human brain already provides to be able to add to the global compute capacity available to the world.

That's a really interesting possibility. The way I put it is a little bit ineloquent, but I think oftentimes when you talk about brain-computer interfaces, the way, for example, Elon Musk talks about Neuralink, it's often talked about from an individual perspective of increasing your ability to communicate with the world and receive information from the world, but if you look from a society perspective, you're now able to leverage the computational power of human brains, either the empty cycles or just the actual computation we'll perform to survive in our daily lives, able to leverage that to add to the global compute surface, the global capacity available in the world.

And the human brain is quite an incredible computing machine, so if you can connect into that and share that computation, I think incredible exponential growth can be achieved without significant innovation on the algorithm side. Now, a lot of the previous things we talked about was more on the hardware side or at least very low-level software side of exponential improvement.

I really like the recent paper from Danny Hernandez and others at OpenAI called "Measuring the Algorithmic Efficiency of Neural Networks" that looks at different kind of domains of machine learning and deep learning and shows that the efficiency of the algorithms involved has increased exponentially, actually far outpacing the improvement of Moore's law.

So if we look at sort of the main one, starting from the ImageNet moment with AlexNet, neural network on a computer vision task, if we look at AlexNet in 2012 and then EfficientNet in 2019 and all the networks that led up to it, the improvement is it takes 44 times less computation to train a neural network to the level of AlexNet.

So if we look at Moore's law in the same kind of span of time, Moore's law would only observe a 11 times decrease in the cost. And the paper highlights also the same kind of exponential improvements in natural language, even in reinforcement learning. So the open question raises is maybe with deep learning, when we look at these learning methods, that the algorithmic process may yield more gains than hardware efficiency improvements.

That's a really exciting possibility, especially for people working in the field, because that means human ingenuity will be essential for the continuation exponential improvement of AI. All that said, whether AI will continue to improve exponentially is an open question. I wanna sort of place a flag down. I don't know, I change my mind every day on most things, but today I feel AI will continue to improve exponentially.

Now, exponential improvement is always just a stack of S-curves. It's not a single sort of nice exponential improvement. It's always kind of big breakthrough innovation on top of each other that level out and then a new innovation comes along. So the other open question is, where will the S-curves that feed the exponential come from most likely, out of the candidates that we discussed?

So for me, the innovation in algorithms and innovation in supervised learning in how data is organized and leveraged in that learning process. So the efficiency of learning and search processes, especially with active learning. You know, there's a lot of terminology swimming around that's a little bit loose. So folks like Yann LeCun is really excited by self-supervised learning.

And you can think of it, you can define it however the heck you want, but you can think of self-supervised learning as leveraging human annotation very little, leveraging human expertise very little. So that's looking at mechanisms that are extremely powerful like self-play and reinforcement learning, or in a video computer vision context, you have the idea would be that you would have an algorithm just watches YouTube videos all day.

And from that is able to figure out the common sense reasoning, the physics of the world and so on in an unsupervised way, just by observing the world. Now, for me, I'm excited by active learning much more, which is the optimization of the way you select the data from which you learn from.

You say, I'm going to learn, I'm going to become increasingly efficient. I'm going to learn from smaller and smaller data sets, but I'm going to be extremely selective about which part of the data I look at and annotate or ask human supervision over. I think a really simple, but exciting example of that in the real world is what the Tesla Autopilot team is doing by creating this pipeline where there's a multitask learning framework, where there's a bunch of different tasks.

And there's a pipeline for discovering edge cases for each of the tasks. And you keep feeding back the edge cases discovered, and then you keep feeding those edge cases back and retraining the network over and over for each of the different tasks. And then there's a shared part of the network that keeps learning over time.

And so there's this active learning framework that just keeps looping over and over and gets better and better over time as it continually discovers and learns from the edge cases. I think that's a very simple example of what I'm talking about, but I'm really excited by that possibility. So innovation and learning in terms of its ability to discover just the right data to improve its performance.

And I believe the performance of active learning can increase exponentially in the coming years. Another source of ESCO is that I'm really excited about, but it's very unpredictable, is the general expansion of the compute surfaces in the world. So it's unclear, but it's very possible that the Internet of Things, IoT, eventually will come around where there's smart devices just everywhere.

And we're not talking about Alexa here or there. We're talking about just everything as a compute surface. I think that's a really exciting possibility of the future. It may be far away, but I certainly hope to be part of the people that tries to create some of that future.

So it's an exciting out there possibility. To me, the total game changer that we don't expect that seems crazy, especially when Elon Musk talks about it in the context of Neuralink, is brain-computer interfaces. I think that's a really exciting technology for helping understand and treat neurological diseases. But if you can make it work to where a computer can communicate in a high bandwidth way with a brain, a two-way communication, that's going to change the nature of computation and the nature of artificial intelligence completely.

If an AI system can communicate with the human brain and each leveraging each other's computation, I don't think we can even imagine the kind of world that that would create. That's a really exciting possibility, but at this time, it's shrouded in uncertainty. It seems impossible and crazy, but if anyone can do it, it's the folks working on brain-computer interfaces and certainly folks like Elon Musk and the brilliant engineers working at Neuralink.

When you talk about exponential improvement in AI, the natural question that people ask is, "When is the singularity coming? "Is it 2030, 2045, 2050, a century from now?" Again, I don't have firm beliefs on this, but from my perspective, I think we're living through the singularity. I think the smoothness of the exponential improvement that we've been a part of in artificial intelligence is sufficiently smooth to where we don't even sense the madness of the curvature of the improvement that we've been living through.

I think it's been just incredible, and every new stage we just so quickly take for granted. I think we're living through the singularity, and I think we'll continue adapting incredibly well to the exponential improvement of AI. I can't wait to what the future holds. So that was my simple attempt to discuss some of the ideas by Rich Sutton in his blog post, "The Bitter Lesson," and the broader context of exponential improvement in AI and the role of computation and algorithmic improvement in that exponential improvement.

This has been part of the AI Paper Club on our Discord server. You're welcome to join anytime. It's not just AI. It's people from all walks of life, all levels of expertise, from artists to musicians to neuroscientists to physicists. It's kind of an incredible community, and I really enjoyed being part of it.

It's, I think, something special, so join us anytime. If you have suggestions for papers we should cover, let me know. Otherwise, thanks for watching, and I'll see you next time. You You You You You You You You

Exponential Progress of AI: Moore's Law, Bitter Lesson, and the Future of Computation

Chapters

Transcript