Exponential Progress of AI: Moore's Law, Bitter Lesson, and the Future of Computation

00:00:00.000 | This video is looking at exponential progress

00:00:02.280 | for artificial intelligence from a historical perspective

00:00:05.120 | and anticipating possible future trajectories

00:00:07.640 | that may or may not lead to exponential progress of AI.

00:00:11.640 | At the center of this discussion is a blog post

00:00:14.800 | called "The Bitter Lesson" by Rich Sutton,

00:00:17.160 | which ties together several different concepts,

00:00:20.000 | specifically looking at the role of computation

00:00:23.080 | in the progress of artificial intelligence

00:00:25.200 | and computer science in general.

00:00:26.940 | This blog post and the broader discussion

00:00:29.040 | is part of the AI Paper Club on our Discord server.

00:00:32.540 | If you wanna join the discussion, everyone is welcome.

00:00:34.980 | Link is in the description.

00:00:37.260 | So I'd like to first discuss the argument made

00:00:39.340 | in "The Bitter Lesson" by Rich Sutton

00:00:41.020 | that discusses the role of computation

00:00:43.900 | in the progress of artificial intelligence.

00:00:45.980 | And then I'd like to look into the future

00:00:47.580 | and see what are the possible ideas

00:00:50.000 | that will carry the flag of exponential improvement in AI,

00:00:55.000 | whether it is in computation

00:00:56.980 | with the continuation of Moore's law,

00:00:58.980 | or a bunch of other ideas in both hardware and software.

00:01:02.580 | So "The Bitter Lesson,"

00:01:04.520 | the basic argument contains several ideas.

00:01:07.380 | The central idea is that most of the improvement

00:01:10.080 | in artificial intelligence over the past 70 years

00:01:12.860 | has occurred due to the improvement of computation

00:01:15.580 | versus improvement in algorithms.

00:01:17.580 | And when I say improvement of computation,

00:01:19.460 | I mean Moore's law, transistor count,

00:01:22.060 | doubling every two years.

00:01:24.100 | And so it wasn't the innovation in the algorithms,

00:01:26.940 | but instead the same brute force algorithms

00:01:29.900 | that were sufficiently general

00:01:31.740 | and were effective at leveraging computation

00:01:34.020 | were the ones associated with successful progress of AI.

00:01:37.800 | So put another way,

00:01:39.980 | general methods that are automated

00:01:41.820 | and can leverage big compute

00:01:44.060 | are better than specialized, fine-tuned,

00:01:47.900 | human expertise injected methods that leverage small compute.

00:01:52.700 | And when I say small compute,

00:01:54.020 | I'm referring to any computational resources available today

00:01:58.260 | because with the exponential growth

00:02:00.620 | of computational resources over the past many decades

00:02:03.120 | with Moore's law,

00:02:04.380 | basically anything you have today

00:02:05.980 | is much smaller than anything you'll have tomorrow.

00:02:08.740 | That's how exponential growth works.

00:02:10.860 | And looking from yet another perspective

00:02:13.180 | of human expertise, human knowledge injection,

00:02:16.700 | AI that discovers a solution by itself

00:02:18.840 | is better than AI that encodes human expertise

00:02:21.980 | and human knowledge.

00:02:23.460 | Rich Sutton also in his blog post argues

00:02:25.860 | that the two categories of techniques

00:02:28.260 | that were most capable of leveraging

00:02:30.940 | massive amounts of computation

00:02:32.340 | are learning techniques and search techniques.

00:02:35.700 | Now, by way of example,

00:02:36.860 | you can think of search techniques

00:02:39.100 | as the ones that were used to beat

00:02:41.140 | Garry Kasparov, IBM D-Blue in the game of chess.

00:02:44.900 | These are these brute force search techniques

00:02:47.100 | that were criticized at the time

00:02:48.420 | for being for their brute force nature.

00:02:51.220 | And the same I would say

00:02:53.540 | is the brute force learning techniques

00:02:57.020 | of Google DeepMind that beat

00:02:59.860 | the world champion at the game of Go.

00:03:02.180 | Now, the reason I call self-play mechanism brute force

00:03:05.340 | is because the reinforcement learning methods of today

00:03:08.580 | are fundamentally wasteful in terms of their,

00:03:11.780 | how efficient they are at learning.

00:03:13.140 | And that's the critical thing about brute force methods

00:03:15.500 | that Rich Sutton argues

00:03:17.100 | that these methods are able to leverage computation.

00:03:19.960 | So they may not be efficient

00:03:21.420 | or they may not have the kind of cleverness

00:03:24.980 | that human expertise might provide,

00:03:28.180 | but they're able to leverage computation

00:03:30.060 | and therefore as computation exponentially grows,

00:03:33.460 | they're able to outperform everything else.

00:03:35.720 | And the blog post provides a few other examples

00:03:37.740 | of speech recognition that started with heuristics

00:03:39.900 | that went to the statistical methods of HMMs.

00:03:42.860 | And finally, now the recent big successes

00:03:45.180 | in speech recognition and natural language processing

00:03:47.500 | in general with neural networks.

00:03:49.660 | And the same in the computer vision world,

00:03:51.840 | the fine-tuned human expertise feature selection

00:03:56.400 | of everything that led up to SIFT.

00:03:59.700 | And then finally with the big image net moment

00:04:02.360 | and showed that neural networks able to discover

00:04:04.820 | automatically the hierarchy of features required

00:04:07.600 | to successfully complete different computer vision tasks.

00:04:10.400 | I think this is a really thought-provoking blog post

00:04:14.280 | because it suggests that when we develop methods,

00:04:17.160 | whether it's in the software or the hardware,

00:04:19.160 | we should be thinking about long-term progress,

00:04:22.360 | the impact of our ideas, not for this year,

00:04:26.120 | but in five years, 10 years, 20 years from now.

00:04:28.840 | So when you look at the progress of the field

00:04:31.560 | from that perspective, there's certain things

00:04:34.040 | that are not gonna hold up.

00:04:35.440 | And Rich argues that actually majority of things

00:04:39.600 | that we work on in the artificial intelligence community,

00:04:42.920 | especially in the academic circles,

00:04:44.980 | is too focused on the injection of human expertise

00:04:49.320 | because that is how you're able to get

00:04:51.260 | incremental improvement that you can publish on

00:04:54.420 | and then sort of get, add publications to your resume,

00:04:58.460 | you have career success and progress,

00:05:00.780 | and you feel better because you're injecting

00:05:02.640 | your own expertise into the system

00:05:05.020 | as opposed to having these, quote-unquote,

00:05:07.180 | dumb brute force approaches.

00:05:09.020 | I think there is something from a human psychologist

00:05:13.020 | perspective about brute force methods

00:05:15.940 | just not being associated with

00:05:18.460 | innovative, brilliant thinking.

00:05:20.780 | In fact, if you look at the brute force search

00:05:22.780 | or the brute force learning approaches,

00:05:25.300 | I think at the time, if we look at it today,

00:05:28.220 | the publications and the science associated

00:05:30.140 | with these methods, I think, did not get

00:05:32.500 | the recognition they deserve.

00:05:34.000 | They got a huge amount of recognition

00:05:35.660 | because of the publicity of the actual matches

00:05:37.680 | they were involved in, but the scientific community,

00:05:40.700 | I don't think, gave enough respect

00:05:43.460 | to the scientific contribution of these general methods.

00:05:46.180 | As interesting, thought-provoking idea,

00:05:50.380 | I would love to see that when people publish papers today,

00:05:54.940 | it maybe almost have like a section

00:05:57.140 | where they describe if computation

00:05:59.860 | was able to be scaled by 10x by 100x,

00:06:03.380 | looking five, 10 years down the future,

00:06:05.740 | will this method hold up to that scaling?

00:06:08.660 | Is it scalable?

00:06:09.500 | Is this method fundamentally scalable?

00:06:11.220 | I think that's a really good question to ask.

00:06:13.500 | Is this something that would benefit,

00:06:15.720 | at least scale linearly with compute?

00:06:18.120 | That, to me, is a really interesting

00:06:19.700 | and provocative question that all graduate students

00:06:22.860 | and faculty and researchers should be asking themselves

00:06:25.500 | about the methods they propose.

00:06:27.680 | Overall, I think this blog post serves

00:06:30.060 | as a really good thought experiment

00:06:32.260 | because I think we often give a disproportionate

00:06:35.060 | amount of respect, I think, to algorithmic improvement

00:06:40.060 | and not enough respect when we look at the big arc

00:06:43.380 | of progress in artificial intelligence to computation,

00:06:46.580 | to the improvement of computation,

00:06:48.060 | whether that's talking about just the raw transistor count

00:06:52.020 | or other aspects of improving the computational process.

00:06:56.540 | If we look at this blog post as it is,

00:06:58.180 | you can, of course, raise some contentions

00:06:59.900 | and some opposing views.

00:07:02.620 | First, the blog post doesn't mention anything about data.

00:07:06.740 | And in terms of learning, if we look at the kind

00:07:10.060 | of learning that's been really successful

00:07:11.900 | for real-world applications, it's supervised learning,

00:07:15.220 | meaning it's learning that uses human annotation of data.

00:07:18.920 | And so the scalability of learning methods

00:07:23.820 | with computation also needs to be coupled

00:07:26.300 | with the scalability of being able to annotate data.

00:07:29.900 | And it's unclear to me how the scalability

00:07:33.580 | with computation is naturally scaled

00:07:37.180 | with annotation of data.

00:07:39.500 | Now, I'll propose some ideas there later on.

00:07:41.780 | I think they're super exciting

00:07:43.040 | in the space of active learning,

00:07:44.620 | but in general, those two are not directly linked,

00:07:47.940 | at least in the argument of the blog post.

00:07:50.860 | So to be fair, the blog post is looking

00:07:52.760 | at the historical context of progress in AI.

00:07:56.380 | And so in that way, it's looking at methods

00:07:57.980 | that leverage the exponential improvement

00:08:00.900 | in raw computation power as observed by Moore's law.

00:08:04.700 | But of course, you can also generalize this blog post

00:08:08.260 | to say really any methods that hook

00:08:10.740 | onto any kind of exponential improvement.

00:08:12.940 | So computation improvement at any level of abstraction,

00:08:17.140 | including as we'll later talk about,

00:08:18.860 | at the highest level of abstraction of deep learning

00:08:22.140 | or even meta-learning.

00:08:23.580 | As long as these methods can hook

00:08:25.940 | onto the exponential improvement in these contexts,

00:08:29.100 | it's able to ride the wave of exponential improvement.

00:08:33.100 | It's just that the main exponential improvement

00:08:36.020 | we've seen in the past 70 years is that of Moore's law.

00:08:39.620 | Another contention that I personally don't find

00:08:41.940 | very convincing is when you say

00:08:44.220 | that learning or search methods

00:08:45.740 | don't require much human expertise,

00:08:47.520 | well, they kind of do.

00:08:48.940 | You still need to do some fine tuning.

00:08:50.980 | There's still a bunch of tricks

00:08:52.340 | even though it's at the higher level.

00:08:54.140 | And the reason I don't find that very convincing

00:08:56.020 | is because I think the amount

00:08:59.060 | and the quality of human expertise required

00:09:01.100 | for deep learning methods is just much smaller

00:09:04.340 | and much more directed

00:09:05.540 | than in classical machine learning methods,

00:09:08.180 | or especially in heuristic-based methods.

00:09:10.900 | Now, one big, I don't know if it's a contention,

00:09:13.220 | but it's an open question for me.

00:09:15.360 | It's often useful when we try to chase

00:09:18.420 | the creation of intelligent systems

00:09:19.820 | to think about the existence proof that we have before us,

00:09:23.200 | which is our own brain.

00:09:24.700 | And I think it's fair to say that the process

00:09:27.560 | that created the intelligence of our brain

00:09:29.380 | is the evolutionary process.

00:09:31.380 | Now, the question as it relates to the blog post, to me,

00:09:36.020 | is whether evolution falls under the category

00:09:39.300 | of search methods or of learning methods,

00:09:41.580 | or some mix of the two.

00:09:42.980 | Is it a subset, a combination of the two,

00:09:45.420 | or is it a superset?

00:09:46.980 | Or is it a completely different kind of thing?

00:09:49.840 | I think that's a really interesting

00:09:51.780 | and really difficult question for me.

00:09:53.180 | That I think about often.

00:09:54.660 | What is the evolutionary process

00:09:57.280 | in terms of our best performing methods of today?

00:10:01.780 | Of course, there's genetic algorithms,

00:10:03.620 | there's genetic programming.

00:10:04.660 | These are very kind of specialized,

00:10:06.540 | evolution-inspired methods.

00:10:09.700 | But the actual evolutionary process

00:10:11.860 | that created life on Earth,

00:10:13.500 | that created intelligent life on Earth,

00:10:15.340 | how does that relate to the search

00:10:17.420 | and the learning methods that leverage computation so well?

00:10:21.140 | It does seem from a 10,000 foot level

00:10:23.860 | that the evolutionary process,

00:10:25.700 | whether it relates to search or learning,

00:10:28.160 | is the kind of process

00:10:29.260 | that would leverage computation very well.

00:10:31.440 | In fact, from a human-centric perspective

00:10:34.900 | of a human that values his life,

00:10:37.600 | the evolutionary process seems to be very brute force,

00:10:40.220 | very wasteful.

00:10:41.780 | So in that way, perhaps it does have similarities

00:10:43.920 | to the brute force search

00:10:45.140 | and the brute force learning self-play mechanisms

00:10:48.660 | that we see so successfully leveraging computation.

00:10:51.820 | So to summarize the argument made in the bitter lesson,

00:10:54.840 | the exponential progress of AI over the past 60, 70 years

00:10:59.140 | was coupled to the exponential progress of computation

00:11:03.020 | with Moore's law and the doubling of transistors.

00:11:05.540 | And as we stand today, the open question then is,

00:11:08.620 | if we look at the possibility

00:11:10.240 | of future exponential improvement

00:11:11.660 | of artificial intelligence,

00:11:13.420 | will that be due to human ingenuity?

00:11:15.400 | So invention of new, better, clever algorithms,

00:11:18.740 | or will it be due to improvement,

00:11:21.980 | increase in raw computation power?

00:11:24.880 | Or I think a distinct option is both.

00:11:28.180 | I'll talk about my bets for this open question

00:11:30.880 | at the end of the video,

00:11:32.080 | but at this time,

00:11:34.100 | let's talk about some possible flag bearers

00:11:38.020 | of exponential improvement in AI

00:11:40.580 | in the coming years and decades.

00:11:43.100 | First, let's look at Moore's law,

00:11:45.520 | which is an observation, it's not a law.

00:11:48.080 | It has two meanings, I would say.

00:11:49.680 | One is the precise technical meaning or the actual meaning,

00:11:53.520 | which is the doubling of transistor count every two years.

00:11:57.000 | Or you can look at it from a financial perspective

00:11:59.240 | and look at dollars per flop,

00:12:01.280 | a decrease in exponentially.

00:12:02.640 | This allows you to compare CPUs and GPUs

00:12:04.920 | and different kinds of processes together on the same plot.

00:12:07.840 | And the second meaning,

00:12:09.000 | I think that's very commonly used

00:12:11.080 | in general public discourse,

00:12:13.000 | is the general sense that there's an exponential improvement

00:12:16.640 | of computational capabilities.

00:12:18.240 | And I'm actually personally okay with that use of Moore's law

00:12:21.480 | as we generalize across different technologies

00:12:24.080 | and different ideas to use Moore's law

00:12:26.640 | to mean the general observation

00:12:29.360 | of the exponential improvement

00:12:30.760 | of computational capabilities.

00:12:32.900 | So the question that's been asked many times

00:12:35.140 | over the past several decades, is Moore's law dead?

00:12:38.120 | I think has two camps.

00:12:40.440 | Majority of the industry says yes.

00:12:43.840 | And then there's a few folks like Jim Keller of now Intel.

00:12:47.280 | I did a podcast with him, I highly recommend it.

00:12:50.720 | Says no, because actually when we look at the size

00:12:54.120 | of transistors, we have not yet hit

00:12:56.080 | the theoretical physics limit

00:12:57.760 | of how small we can get with the transistors.

00:13:00.880 | Now it gets extremely difficult for many reasons

00:13:04.620 | to get a transistor that starts approaching the size

00:13:07.160 | of a single nanometer in terms of power,

00:13:10.640 | in terms of error correction,

00:13:12.840 | in terms of what's required for actual fabrication

00:13:15.720 | of that kind of scale of thing.

00:13:18.040 | But the theoretical physics limit hasn't been reached

00:13:20.680 | so Moore's law can continue.

00:13:22.340 | But also if we look at the broader definition

00:13:26.040 | of just exponential improvement

00:13:27.920 | of computational capabilities,

00:13:29.820 | there's a lot of other candidates, flag bearers,

00:13:32.480 | as I mentioned, that could carry

00:13:34.400 | that exponential flag forward.

00:13:36.440 | And let's look at them now.

00:13:37.840 | One is the global compute capacity.

00:13:41.540 | Now this one is really interesting.

00:13:43.120 | And I actually had trouble finding good data

00:13:45.780 | to answer the very basic question.

00:13:47.360 | I don't think that data exists.

00:13:48.960 | The question being how much total compute capacity

00:13:53.220 | is there in the world today?

00:13:55.240 | And looking historically, how has it been increasing?

00:13:58.280 | There's a few kind of speculative studies.

00:14:00.320 | Some of them I cite here.

00:14:01.460 | They're really interesting,

00:14:02.800 | but I do wish there was a little bit more data.

00:14:04.960 | I'm actually really excited by the potential of this

00:14:09.240 | in a way that is completely unexpected

00:14:12.000 | potentially in the future.

00:14:13.360 | Now, what are we talking about?

00:14:14.880 | We're talking about the actual number

00:14:16.900 | of general compute capable devices in the world.

00:14:20.760 | One of the really powerful compute devices

00:14:22.920 | that appeared over the past 20 years is gaming consoles.

00:14:27.800 | The other one, I mean, past maybe 10 years

00:14:30.480 | is a smartphone devices.

00:14:32.580 | Now, if we look into the future,

00:14:34.240 | the possibility, first of all,

00:14:35.680 | smartphone devices growing exponentially,

00:14:38.200 | but also the compute surface across all types of devices.

00:14:42.840 | So if we think of internet of things,

00:14:44.880 | every object in our day-to-day life

00:14:46.600 | gaining computational capabilities

00:14:48.860 | means that that computation can be then leveraged

00:14:51.280 | in some distributed way.

00:14:52.920 | And then we can look at an entirely other dimension

00:14:55.880 | of devices that could explode exponentially

00:14:58.240 | in the near or the long-term future

00:15:00.920 | of virtual reality and augmented reality devices.

00:15:04.520 | So currently both of those types of devices

00:15:06.400 | are not really gaining ground,

00:15:08.120 | but it's very possible that in the future,

00:15:13.000 | a huge amount of computational resources

00:15:14.880 | become available for these virtual worlds,

00:15:17.760 | for augmented worlds.

00:15:19.800 | So I'm actually really excited by the potential things

00:15:22.640 | that we can't yet expect

00:15:24.600 | in terms of the exponential growth of actual devices,

00:15:27.720 | which are able to do computation.

00:15:29.880 | The exponential expansion of compute surfaces in our world.

00:15:34.880 | That's really interesting.

00:15:36.280 | That might force us to rethink the nature of computation,

00:15:39.460 | to push it more and more

00:15:40.800 | towards like distributed computation.

00:15:44.200 | So speaking of distributed computation,

00:15:46.240 | another possibility of exponential growth of AI

00:15:49.280 | is just massively parallel computation.

00:15:52.220 | So increasing CPUs, GPUs,

00:15:54.680 | stacking them on top of each other

00:15:56.920 | and increasing that stack exponentially.

00:16:00.160 | Now you run up against Amdahl's law

00:16:03.040 | and all kinds of challenges that characterize

00:16:06.440 | that as you increase the number of processors,

00:16:10.880 | it becomes more and more difficult.

00:16:12.840 | There's a diminishing return

00:16:15.440 | in terms of the compute speed up you gain

00:16:17.240 | when you add more processors.

00:16:18.440 | Now, if we can overcome that Amdahl's law,

00:16:20.360 | if we can successfully design algorithms

00:16:23.580 | that are perfectly parallelizable across thousands,

00:16:27.500 | maybe millions, maybe billions of processors,

00:16:30.580 | then that changes the game.

00:16:32.580 | That changes the game

00:16:33.420 | and allows us to exponentially improve the AI algorithms

00:16:37.620 | by exponentially increasing

00:16:39.740 | the number of processors involved.

00:16:42.740 | Another dimension of approaches

00:16:44.580 | that contribute to exponential growth of AI

00:16:47.420 | is devices that are at their core parallelizable.

00:16:51.980 | More general devices like the GPUs,

00:16:54.860 | graphic processing units,

00:16:56.300 | or ones that are actually specific to neural networks

00:16:59.660 | or whatever the algorithm is,

00:17:00.980 | which is ASICs, application specific integrated circuits.

00:17:04.620 | The TPU by Google being an excellent example of that

00:17:07.140 | where there's a bunch of hardware design decisions made

00:17:09.660 | that are specialized to machine learning,

00:17:11.180 | allowing it to be much more efficient

00:17:13.100 | in terms of both energy use

00:17:14.500 | and the actual performance of the algorithm.

00:17:18.300 | Now, another big space

00:17:19.820 | that I could probably divide in many slides

00:17:22.300 | of flag bearers for exponential AI growth

00:17:25.640 | is changing the actual nature of computation.

00:17:30.020 | So a completely different kind of computation.

00:17:33.020 | So two exciting candidates shown here,

00:17:35.020 | one is quantum computing

00:17:36.620 | and the other is neuromorphic computing.

00:17:38.900 | You're probably familiar with quantum computers,

00:17:41.140 | with qubits, that versus classical computers

00:17:43.860 | that only represent zeros and ones,

00:17:45.540 | qubits also represent zero ones

00:17:47.660 | and the superposition of zero ones.

00:17:50.020 | So there is a lot of excitement and development

00:17:52.740 | in this space, but it's, I would say, very early days,

00:17:56.780 | especially considering general methods

00:18:00.300 | that are able to leverage computation.

00:18:02.140 | First, it's really hard to build large quantum computers,

00:18:06.980 | but even if you can, second,

00:18:09.020 | it's very hard to build algorithms

00:18:10.820 | that significantly outperform the algorithms

00:18:13.500 | on classical computers,

00:18:15.100 | especially in the space of artificial intelligence

00:18:17.380 | with machine learning.

00:18:19.060 | Then there's another space of computing

00:18:21.000 | called neuromorphic computing

00:18:22.700 | that draws a lot of inspiration,

00:18:24.620 | a lot more inspiration from the human brain.

00:18:28.480 | Specifically, it models spiking networks.

00:18:31.540 | Now, the idea, I think, with neuromorphic computing

00:18:34.500 | is it's able to perform computation

00:18:36.620 | in a much more efficient way.

00:18:38.300 | One of the characteristic things about the human brain

00:18:41.340 | versus our computers today

00:18:42.700 | is it's much more energy efficient than our computers

00:18:45.820 | for the same amount of computation.

00:18:48.720 | So neuromorphic computing is trying to achieve

00:18:51.460 | the same kind of performance.

00:18:53.180 | Again, very early days,

00:18:55.340 | and it's unclear how you can design general algorithms

00:18:59.300 | that reach even close to the same performance

00:19:02.420 | of machine learning algorithms, for example,

00:19:04.940 | run on classical computers of today with GPUs or ASICs.

00:19:08.620 | But of course, if you want to have a complete shift,

00:19:13.140 | like a phase shift in terms of the way

00:19:15.240 | we approach computation and artificial intelligence,

00:19:17.880 | a computer which functions in a completely different way

00:19:21.960 | than our classical computers is something

00:19:24.540 | that might be able to achieve that kind of phase shift.

00:19:27.180 | Now, another really exciting space of methodologies

00:19:30.220 | is brain-computer interfaces.

00:19:32.000 | In the short term, it's exciting

00:19:33.580 | because it may help us understand

00:19:36.460 | and treat neurological diseases.

00:19:38.740 | But in the long term,

00:19:40.520 | the possibility of leveraging human brains

00:19:44.780 | for computation, now that's a weird way to put it,

00:19:48.140 | but we have a lot of compute power in our brains.

00:19:50.900 | We're actually doing a lot of computation, each one of us,

00:19:54.540 | every living moment of our lives.

00:19:57.360 | And the unfortunate thing is we're not able

00:20:00.340 | to share the outcome of that computation with the world.

00:20:04.940 | We share it with a very low bandwidth channel.

00:20:08.020 | So not from an individual perspective,

00:20:11.380 | but from a perspective of society,

00:20:13.020 | it's interesting to consider if we can create

00:20:15.420 | a high bandwidth connection between a computer

00:20:17.220 | and a human brain, then we're able to leverage

00:20:21.100 | the computation the human brain already provides

00:20:24.000 | to be able to add to the global compute capacity

00:20:28.860 | available to the world.

00:20:30.820 | That's a really interesting possibility.

00:20:33.080 | The way I put it is a little bit ineloquent,

00:20:36.140 | but I think oftentimes when you talk

00:20:38.380 | about brain-computer interfaces, the way, for example,

00:20:41.100 | Elon Musk talks about Neuralink, it's often talked about

00:20:44.420 | from an individual perspective of increasing your ability

00:20:48.220 | to communicate with the world and receive information

00:20:50.260 | from the world, but if you look from a society perspective,

00:20:53.460 | you're now able to leverage the computational power

00:20:56.740 | of human brains, either the empty cycles

00:20:59.420 | or just the actual computation we'll perform

00:21:01.620 | to survive in our daily lives, able to leverage that

00:21:05.460 | to add to the global compute surface,

00:21:09.660 | the global capacity available in the world.

00:21:12.100 | And the human brain is quite an incredible

00:21:14.780 | computing machine, so if you can connect into that

00:21:18.340 | and share that computation, I think incredible

00:21:23.340 | exponential growth can be achieved

00:21:26.420 | without significant innovation on the algorithm side.

00:21:29.380 | Now, a lot of the previous things we talked about

00:21:31.400 | was more on the hardware side or at least

00:21:33.860 | very low-level software side of exponential improvement.

00:21:37.300 | I really like the recent paper from Danny Hernandez

00:21:41.180 | and others at OpenAI called "Measuring the Algorithmic

00:21:43.820 | Efficiency of Neural Networks" that looks at different

00:21:46.300 | kind of domains of machine learning and deep learning

00:21:48.800 | and shows that the efficiency of the algorithms involved

00:21:51.620 | has increased exponentially, actually far outpacing

00:21:54.540 | the improvement of Moore's law.

00:21:55.620 | So if we look at sort of the main one,

00:21:58.540 | starting from the ImageNet moment with AlexNet,

00:22:01.500 | neural network on a computer vision task,

00:22:03.140 | if we look at AlexNet in 2012 and then EfficientNet in 2019

00:22:08.140 | and all the networks that led up to it,

00:22:11.140 | the improvement is it takes 44 times less computation

00:22:16.140 | to train a neural network to the level of AlexNet.

00:22:19.020 | So if we look at Moore's law in the same kind of

00:22:22.340 | span of time, Moore's law would only observe

00:22:25.700 | a 11 times decrease in the cost.

00:22:29.540 | And the paper highlights also the same kind of

00:22:32.180 | exponential improvements in natural language,

00:22:34.460 | even in reinforcement learning.

00:22:36.220 | So the open question raises is maybe with deep learning,

00:22:40.140 | when we look at these learning methods,

00:22:42.060 | that the algorithmic process may yield more gains

00:22:45.500 | than hardware efficiency improvements.

00:22:48.140 | That's a really exciting possibility,

00:22:49.620 | especially for people working in the field,

00:22:51.500 | because that means human ingenuity will be essential

00:22:55.040 | for the continuation exponential improvement of AI.

00:22:58.700 | All that said, whether AI will continue to improve

00:23:02.060 | exponentially is an open question.

00:23:04.220 | I wanna sort of place a flag down.

00:23:06.460 | I don't know, I change my mind every day on most things,

00:23:08.540 | but today I feel AI will continue to improve exponentially.

00:23:12.340 | Now, exponential improvement is always

00:23:14.340 | just a stack of S-curves.

00:23:15.820 | It's not a single sort of nice exponential improvement.

00:23:19.140 | It's always kind of big breakthrough innovation

00:23:21.700 | on top of each other that level out

00:23:24.900 | and then a new innovation comes along.

00:23:27.980 | So the other open question is,

00:23:30.180 | where will the S-curves that feed the exponential

00:23:32.420 | come from most likely,

00:23:33.620 | out of the candidates that we discussed?

00:23:35.700 | So for me, the innovation in algorithms

00:23:38.340 | and innovation in supervised learning

00:23:40.540 | in how data is organized and leveraged

00:23:44.780 | in that learning process.

00:23:45.860 | So the efficiency of learning and search processes,

00:23:48.820 | especially with active learning.

00:23:50.740 | You know, there's a lot of terminology swimming around

00:23:52.860 | that's a little bit loose.

00:23:54.180 | So folks like Yann LeCun is really excited

00:23:56.780 | by self-supervised learning.

00:23:58.740 | And you can think of it,

00:23:59.980 | you can define it however the heck you want,

00:24:01.620 | but you can think of self-supervised learning

00:24:03.940 | as leveraging human annotation very little,

00:24:07.140 | leveraging human expertise very little.

00:24:09.980 | So that's looking at mechanisms that are extremely powerful

00:24:12.500 | like self-play and reinforcement learning,

00:24:14.860 | or in a video computer vision context,

00:24:17.220 | you have the idea would be that you would have an algorithm

00:24:20.740 | just watches YouTube videos all day.

00:24:22.820 | And from that is able to figure out

00:24:24.620 | the common sense reasoning,

00:24:26.100 | the physics of the world and so on

00:24:28.100 | in an unsupervised way, just by observing the world.

00:24:31.580 | Now, for me, I'm excited by active learning much more,

00:24:34.740 | which is the optimization of the way you select the data

00:24:38.420 | from which you learn from.

00:24:39.620 | You say, I'm going to learn,

00:24:40.740 | I'm going to become increasingly efficient.

00:24:42.700 | I'm going to learn from smaller and smaller data sets,

00:24:44.820 | but I'm going to be extremely selective

00:24:46.980 | about which part of the data I look at

00:24:49.300 | and annotate or ask human supervision over.

00:24:53.260 | I think a really simple, but exciting example

00:24:55.900 | of that in the real world is what the Tesla Autopilot team

00:24:58.220 | is doing by creating this pipeline

00:25:00.660 | where there's a multitask learning framework,

00:25:04.180 | where there's a bunch of different tasks.

00:25:06.140 | And there's a pipeline for discovering edge cases

00:25:09.100 | for each of the tasks.

00:25:10.060 | And you keep feeding back the edge cases discovered,

00:25:13.700 | and then you keep feeding those edge cases back

00:25:15.460 | and retraining the network over and over

00:25:17.060 | for each of the different tasks.

00:25:18.620 | And then there's a shared part of the network

00:25:21.020 | that keeps learning over time.

00:25:22.180 | And so there's this active learning framework

00:25:24.900 | that just keeps looping over and over

00:25:26.260 | and gets better and better over time

00:25:27.660 | as it continually discovers and learns from the edge cases.

00:25:32.660 | I think that's a very simple example

00:25:34.300 | of what I'm talking about,

00:25:35.140 | but I'm really excited by that possibility.

00:25:37.500 | So innovation and learning in terms of its ability

00:25:40.060 | to discover just the right data to improve its performance.

00:25:44.900 | And I believe the performance of active learning

00:25:46.860 | can increase exponentially in the coming years.

00:25:49.620 | Another source of ESCO is that I'm really excited about,

00:25:53.460 | but it's very unpredictable,

00:25:55.500 | is the general expansion of the compute surfaces

00:25:58.340 | in the world.

00:25:59.460 | So it's unclear, but it's very possible

00:26:03.820 | that the Internet of Things, IoT,

00:26:07.540 | eventually will come around

00:26:09.460 | where there's smart devices just everywhere.

00:26:12.140 | And we're not talking about Alexa here or there.

00:26:14.480 | We're talking about just everything as a compute surface.

00:26:19.340 | I think that's a really exciting possibility of the future.

00:26:21.980 | It may be far away,

00:26:23.300 | but I certainly hope to be part of the people

00:26:25.460 | that tries to create some of that future.

00:26:29.040 | So it's an exciting out there possibility.

00:26:32.100 | To me, the total game changer that we don't expect

00:26:35.140 | that seems crazy,

00:26:37.020 | especially when Elon Musk talks about it

00:26:38.820 | in the context of Neuralink, is brain-computer interfaces.

00:26:41.960 | I think that's a really exciting technology

00:26:44.540 | for helping understand and treat neurological diseases.

00:26:49.020 | But if you can make it work

00:26:50.940 | to where a computer can communicate

00:26:52.580 | in a high bandwidth way with a brain,

00:26:54.940 | a two-way communication,

00:26:56.780 | that's going to change the nature of computation

00:26:59.480 | and the nature of artificial intelligence completely.

00:27:02.380 | If an AI system can communicate with the human brain

00:27:05.180 | and each leveraging each other's computation,

00:27:07.860 | I don't think we can even imagine

00:27:11.420 | the kind of world that that would create.

00:27:13.820 | That's a really exciting possibility,

00:27:16.760 | but at this time, it's shrouded in uncertainty.

00:27:21.160 | It seems impossible and crazy,

00:27:23.840 | but if anyone can do it,

00:27:24.920 | it's the folks working on brain-computer interfaces

00:27:27.040 | and certainly folks like Elon Musk

00:27:28.840 | and the brilliant engineers working at Neuralink.

00:27:32.920 | When you talk about exponential improvement in AI,

00:27:35.980 | the natural question that people ask is,

00:27:37.880 | "When is the singularity coming?

00:27:39.480 | "Is it 2030, 2045, 2050, a century from now?"

00:27:44.960 | Again, I don't have firm beliefs on this,

00:27:48.320 | but from my perspective,

00:27:49.920 | I think we're living through the singularity.

00:27:52.520 | I think the smoothness of the exponential improvement

00:27:55.960 | that we've been a part of in artificial intelligence

00:27:59.600 | is sufficiently smooth to where we don't even sense

00:28:02.400 | the madness of the curvature of the improvement

00:28:05.840 | that we've been living through.

00:28:06.800 | I think it's been just incredible,

00:28:08.860 | and every new stage we just so quickly take for granted.

00:28:13.400 | I think we're living through the singularity,

00:28:15.120 | and I think we'll continue adapting incredibly well

00:28:19.360 | to the exponential improvement of AI.

00:28:22.040 | I can't wait to what the future holds.

00:28:24.320 | So that was my simple attempt to discuss some of the ideas

00:28:27.240 | by Rich Sutton in his blog post, "The Bitter Lesson,"

00:28:30.720 | and the broader context of exponential improvement in AI

00:28:33.720 | and the role of computation and algorithmic improvement

00:28:36.240 | in that exponential improvement.

00:28:38.440 | This has been part of the AI Paper Club

00:28:41.520 | on our Discord server.

00:28:43.000 | You're welcome to join anytime.

00:28:44.400 | It's not just AI.

00:28:45.320 | It's people from all walks of life, all levels of expertise,

00:28:49.240 | from artists to musicians to neuroscientists to physicists.

00:28:53.680 | It's kind of an incredible community,

00:28:55.040 | and I really enjoyed being part of it.

00:28:56.800 | It's, I think, something special, so join us anytime.

00:29:01.440 | If you have suggestions for papers we should cover,

00:29:03.520 | let me know.

00:29:04.680 | Otherwise, thanks for watching, and I'll see you next time.

00:29:07.640 | You

00:29:09.640 | You

00:29:11.640 | You

00:29:13.640 | You

00:29:15.640 | You

00:29:17.640 | You

00:29:19.640 | You

00:29:21.640 | You

00:29:23.640 | [BLANK_AUDIO]

Exponential Progress of AI: Moore's Law, Bitter Lesson, and the Future of Computation

Chapters