back to index

Exponential Progress of AI: Moore's Law, Bitter Lesson, and the Future of Computation


Chapters

0:0 Overview
0:37 Bitter Lesson by Rich Sutton
6:55 Contentions and opposing views
9:10 Is evolution a part of search, learning, or something else?
10:51 Bitter Lesson argument summary
11:42 Moore's Law
13:37 Global compute capacity
15:43 Massively parallel computation
16:41 GPUs and ASICs
17:17 Quantum computing and neuromorphic computing
19:25 Neuralink and brain-computer interfaces
21:28 Deep learning efficiency
22:57 Open questions for exponential improvement of AI
28:22 Conclusion

Whisper Transcript | Transcript Only Page

00:00:00.000 | This video is looking at exponential progress
00:00:02.280 | for artificial intelligence from a historical perspective
00:00:05.120 | and anticipating possible future trajectories
00:00:07.640 | that may or may not lead to exponential progress of AI.
00:00:11.640 | At the center of this discussion is a blog post
00:00:14.800 | called "The Bitter Lesson" by Rich Sutton,
00:00:17.160 | which ties together several different concepts,
00:00:20.000 | specifically looking at the role of computation
00:00:23.080 | in the progress of artificial intelligence
00:00:25.200 | and computer science in general.
00:00:26.940 | This blog post and the broader discussion
00:00:29.040 | is part of the AI Paper Club on our Discord server.
00:00:32.540 | If you wanna join the discussion, everyone is welcome.
00:00:34.980 | Link is in the description.
00:00:37.260 | So I'd like to first discuss the argument made
00:00:39.340 | in "The Bitter Lesson" by Rich Sutton
00:00:41.020 | that discusses the role of computation
00:00:43.900 | in the progress of artificial intelligence.
00:00:45.980 | And then I'd like to look into the future
00:00:47.580 | and see what are the possible ideas
00:00:50.000 | that will carry the flag of exponential improvement in AI,
00:00:55.000 | whether it is in computation
00:00:56.980 | with the continuation of Moore's law,
00:00:58.980 | or a bunch of other ideas in both hardware and software.
00:01:02.580 | So "The Bitter Lesson,"
00:01:04.520 | the basic argument contains several ideas.
00:01:07.380 | The central idea is that most of the improvement
00:01:10.080 | in artificial intelligence over the past 70 years
00:01:12.860 | has occurred due to the improvement of computation
00:01:15.580 | versus improvement in algorithms.
00:01:17.580 | And when I say improvement of computation,
00:01:19.460 | I mean Moore's law, transistor count,
00:01:22.060 | doubling every two years.
00:01:24.100 | And so it wasn't the innovation in the algorithms,
00:01:26.940 | but instead the same brute force algorithms
00:01:29.900 | that were sufficiently general
00:01:31.740 | and were effective at leveraging computation
00:01:34.020 | were the ones associated with successful progress of AI.
00:01:37.800 | So put another way,
00:01:39.980 | general methods that are automated
00:01:41.820 | and can leverage big compute
00:01:44.060 | are better than specialized, fine-tuned,
00:01:47.900 | human expertise injected methods that leverage small compute.
00:01:52.700 | And when I say small compute,
00:01:54.020 | I'm referring to any computational resources available today
00:01:58.260 | because with the exponential growth
00:02:00.620 | of computational resources over the past many decades
00:02:03.120 | with Moore's law,
00:02:04.380 | basically anything you have today
00:02:05.980 | is much smaller than anything you'll have tomorrow.
00:02:08.740 | That's how exponential growth works.
00:02:10.860 | And looking from yet another perspective
00:02:13.180 | of human expertise, human knowledge injection,
00:02:16.700 | AI that discovers a solution by itself
00:02:18.840 | is better than AI that encodes human expertise
00:02:21.980 | and human knowledge.
00:02:23.460 | Rich Sutton also in his blog post argues
00:02:25.860 | that the two categories of techniques
00:02:28.260 | that were most capable of leveraging
00:02:30.940 | massive amounts of computation
00:02:32.340 | are learning techniques and search techniques.
00:02:35.700 | Now, by way of example,
00:02:36.860 | you can think of search techniques
00:02:39.100 | as the ones that were used to beat
00:02:41.140 | Garry Kasparov, IBM D-Blue in the game of chess.
00:02:44.900 | These are these brute force search techniques
00:02:47.100 | that were criticized at the time
00:02:48.420 | for being for their brute force nature.
00:02:51.220 | And the same I would say
00:02:53.540 | is the brute force learning techniques
00:02:57.020 | of Google DeepMind that beat
00:02:59.860 | the world champion at the game of Go.
00:03:02.180 | Now, the reason I call self-play mechanism brute force
00:03:05.340 | is because the reinforcement learning methods of today
00:03:08.580 | are fundamentally wasteful in terms of their,
00:03:11.780 | how efficient they are at learning.
00:03:13.140 | And that's the critical thing about brute force methods
00:03:15.500 | that Rich Sutton argues
00:03:17.100 | that these methods are able to leverage computation.
00:03:19.960 | So they may not be efficient
00:03:21.420 | or they may not have the kind of cleverness
00:03:24.980 | that human expertise might provide,
00:03:28.180 | but they're able to leverage computation
00:03:30.060 | and therefore as computation exponentially grows,
00:03:33.460 | they're able to outperform everything else.
00:03:35.720 | And the blog post provides a few other examples
00:03:37.740 | of speech recognition that started with heuristics
00:03:39.900 | that went to the statistical methods of HMMs.
00:03:42.860 | And finally, now the recent big successes
00:03:45.180 | in speech recognition and natural language processing
00:03:47.500 | in general with neural networks.
00:03:49.660 | And the same in the computer vision world,
00:03:51.840 | the fine-tuned human expertise feature selection
00:03:56.400 | of everything that led up to SIFT.
00:03:59.700 | And then finally with the big image net moment
00:04:02.360 | and showed that neural networks able to discover
00:04:04.820 | automatically the hierarchy of features required
00:04:07.600 | to successfully complete different computer vision tasks.
00:04:10.400 | I think this is a really thought-provoking blog post
00:04:14.280 | because it suggests that when we develop methods,
00:04:17.160 | whether it's in the software or the hardware,
00:04:19.160 | we should be thinking about long-term progress,
00:04:22.360 | the impact of our ideas, not for this year,
00:04:26.120 | but in five years, 10 years, 20 years from now.
00:04:28.840 | So when you look at the progress of the field
00:04:31.560 | from that perspective, there's certain things
00:04:34.040 | that are not gonna hold up.
00:04:35.440 | And Rich argues that actually majority of things
00:04:39.600 | that we work on in the artificial intelligence community,
00:04:42.920 | especially in the academic circles,
00:04:44.980 | is too focused on the injection of human expertise
00:04:49.320 | because that is how you're able to get
00:04:51.260 | incremental improvement that you can publish on
00:04:54.420 | and then sort of get, add publications to your resume,
00:04:58.460 | you have career success and progress,
00:05:00.780 | and you feel better because you're injecting
00:05:02.640 | your own expertise into the system
00:05:05.020 | as opposed to having these, quote-unquote,
00:05:07.180 | dumb brute force approaches.
00:05:09.020 | I think there is something from a human psychologist
00:05:13.020 | perspective about brute force methods
00:05:15.940 | just not being associated with
00:05:18.460 | innovative, brilliant thinking.
00:05:20.780 | In fact, if you look at the brute force search
00:05:22.780 | or the brute force learning approaches,
00:05:25.300 | I think at the time, if we look at it today,
00:05:28.220 | the publications and the science associated
00:05:30.140 | with these methods, I think, did not get
00:05:32.500 | the recognition they deserve.
00:05:34.000 | They got a huge amount of recognition
00:05:35.660 | because of the publicity of the actual matches
00:05:37.680 | they were involved in, but the scientific community,
00:05:40.700 | I don't think, gave enough respect
00:05:43.460 | to the scientific contribution of these general methods.
00:05:46.180 | As interesting, thought-provoking idea,
00:05:50.380 | I would love to see that when people publish papers today,
00:05:54.940 | it maybe almost have like a section
00:05:57.140 | where they describe if computation
00:05:59.860 | was able to be scaled by 10x by 100x,
00:06:03.380 | looking five, 10 years down the future,
00:06:05.740 | will this method hold up to that scaling?
00:06:08.660 | Is it scalable?
00:06:09.500 | Is this method fundamentally scalable?
00:06:11.220 | I think that's a really good question to ask.
00:06:13.500 | Is this something that would benefit,
00:06:15.720 | at least scale linearly with compute?
00:06:18.120 | That, to me, is a really interesting
00:06:19.700 | and provocative question that all graduate students
00:06:22.860 | and faculty and researchers should be asking themselves
00:06:25.500 | about the methods they propose.
00:06:27.680 | Overall, I think this blog post serves
00:06:30.060 | as a really good thought experiment
00:06:32.260 | because I think we often give a disproportionate
00:06:35.060 | amount of respect, I think, to algorithmic improvement
00:06:40.060 | and not enough respect when we look at the big arc
00:06:43.380 | of progress in artificial intelligence to computation,
00:06:46.580 | to the improvement of computation,
00:06:48.060 | whether that's talking about just the raw transistor count
00:06:52.020 | or other aspects of improving the computational process.
00:06:56.540 | If we look at this blog post as it is,
00:06:58.180 | you can, of course, raise some contentions
00:06:59.900 | and some opposing views.
00:07:02.620 | First, the blog post doesn't mention anything about data.
00:07:06.740 | And in terms of learning, if we look at the kind
00:07:10.060 | of learning that's been really successful
00:07:11.900 | for real-world applications, it's supervised learning,
00:07:15.220 | meaning it's learning that uses human annotation of data.
00:07:18.920 | And so the scalability of learning methods
00:07:23.820 | with computation also needs to be coupled
00:07:26.300 | with the scalability of being able to annotate data.
00:07:29.900 | And it's unclear to me how the scalability
00:07:33.580 | with computation is naturally scaled
00:07:37.180 | with annotation of data.
00:07:39.500 | Now, I'll propose some ideas there later on.
00:07:41.780 | I think they're super exciting
00:07:43.040 | in the space of active learning,
00:07:44.620 | but in general, those two are not directly linked,
00:07:47.940 | at least in the argument of the blog post.
00:07:50.860 | So to be fair, the blog post is looking
00:07:52.760 | at the historical context of progress in AI.
00:07:56.380 | And so in that way, it's looking at methods
00:07:57.980 | that leverage the exponential improvement
00:08:00.900 | in raw computation power as observed by Moore's law.
00:08:04.700 | But of course, you can also generalize this blog post
00:08:08.260 | to say really any methods that hook
00:08:10.740 | onto any kind of exponential improvement.
00:08:12.940 | So computation improvement at any level of abstraction,
00:08:17.140 | including as we'll later talk about,
00:08:18.860 | at the highest level of abstraction of deep learning
00:08:22.140 | or even meta-learning.
00:08:23.580 | As long as these methods can hook
00:08:25.940 | onto the exponential improvement in these contexts,
00:08:29.100 | it's able to ride the wave of exponential improvement.
00:08:33.100 | It's just that the main exponential improvement
00:08:36.020 | we've seen in the past 70 years is that of Moore's law.
00:08:39.620 | Another contention that I personally don't find
00:08:41.940 | very convincing is when you say
00:08:44.220 | that learning or search methods
00:08:45.740 | don't require much human expertise,
00:08:47.520 | well, they kind of do.
00:08:48.940 | You still need to do some fine tuning.
00:08:50.980 | There's still a bunch of tricks
00:08:52.340 | even though it's at the higher level.
00:08:54.140 | And the reason I don't find that very convincing
00:08:56.020 | is because I think the amount
00:08:59.060 | and the quality of human expertise required
00:09:01.100 | for deep learning methods is just much smaller
00:09:04.340 | and much more directed
00:09:05.540 | than in classical machine learning methods,
00:09:08.180 | or especially in heuristic-based methods.
00:09:10.900 | Now, one big, I don't know if it's a contention,
00:09:13.220 | but it's an open question for me.
00:09:15.360 | It's often useful when we try to chase
00:09:18.420 | the creation of intelligent systems
00:09:19.820 | to think about the existence proof that we have before us,
00:09:23.200 | which is our own brain.
00:09:24.700 | And I think it's fair to say that the process
00:09:27.560 | that created the intelligence of our brain
00:09:29.380 | is the evolutionary process.
00:09:31.380 | Now, the question as it relates to the blog post, to me,
00:09:36.020 | is whether evolution falls under the category
00:09:39.300 | of search methods or of learning methods,
00:09:41.580 | or some mix of the two.
00:09:42.980 | Is it a subset, a combination of the two,
00:09:45.420 | or is it a superset?
00:09:46.980 | Or is it a completely different kind of thing?
00:09:49.840 | I think that's a really interesting
00:09:51.780 | and really difficult question for me.
00:09:53.180 | That I think about often.
00:09:54.660 | What is the evolutionary process
00:09:57.280 | in terms of our best performing methods of today?
00:10:01.780 | Of course, there's genetic algorithms,
00:10:03.620 | there's genetic programming.
00:10:04.660 | These are very kind of specialized,
00:10:06.540 | evolution-inspired methods.
00:10:09.700 | But the actual evolutionary process
00:10:11.860 | that created life on Earth,
00:10:13.500 | that created intelligent life on Earth,
00:10:15.340 | how does that relate to the search
00:10:17.420 | and the learning methods that leverage computation so well?
00:10:21.140 | It does seem from a 10,000 foot level
00:10:23.860 | that the evolutionary process,
00:10:25.700 | whether it relates to search or learning,
00:10:28.160 | is the kind of process
00:10:29.260 | that would leverage computation very well.
00:10:31.440 | In fact, from a human-centric perspective
00:10:34.900 | of a human that values his life,
00:10:37.600 | the evolutionary process seems to be very brute force,
00:10:40.220 | very wasteful.
00:10:41.780 | So in that way, perhaps it does have similarities
00:10:43.920 | to the brute force search
00:10:45.140 | and the brute force learning self-play mechanisms
00:10:48.660 | that we see so successfully leveraging computation.
00:10:51.820 | So to summarize the argument made in the bitter lesson,
00:10:54.840 | the exponential progress of AI over the past 60, 70 years
00:10:59.140 | was coupled to the exponential progress of computation
00:11:03.020 | with Moore's law and the doubling of transistors.
00:11:05.540 | And as we stand today, the open question then is,
00:11:08.620 | if we look at the possibility
00:11:10.240 | of future exponential improvement
00:11:11.660 | of artificial intelligence,
00:11:13.420 | will that be due to human ingenuity?
00:11:15.400 | So invention of new, better, clever algorithms,
00:11:18.740 | or will it be due to improvement,
00:11:21.980 | increase in raw computation power?
00:11:24.880 | Or I think a distinct option is both.
00:11:28.180 | I'll talk about my bets for this open question
00:11:30.880 | at the end of the video,
00:11:32.080 | but at this time,
00:11:34.100 | let's talk about some possible flag bearers
00:11:38.020 | of exponential improvement in AI
00:11:40.580 | in the coming years and decades.
00:11:43.100 | First, let's look at Moore's law,
00:11:45.520 | which is an observation, it's not a law.
00:11:48.080 | It has two meanings, I would say.
00:11:49.680 | One is the precise technical meaning or the actual meaning,
00:11:53.520 | which is the doubling of transistor count every two years.
00:11:57.000 | Or you can look at it from a financial perspective
00:11:59.240 | and look at dollars per flop,
00:12:01.280 | a decrease in exponentially.
00:12:02.640 | This allows you to compare CPUs and GPUs
00:12:04.920 | and different kinds of processes together on the same plot.
00:12:07.840 | And the second meaning,
00:12:09.000 | I think that's very commonly used
00:12:11.080 | in general public discourse,
00:12:13.000 | is the general sense that there's an exponential improvement
00:12:16.640 | of computational capabilities.
00:12:18.240 | And I'm actually personally okay with that use of Moore's law
00:12:21.480 | as we generalize across different technologies
00:12:24.080 | and different ideas to use Moore's law
00:12:26.640 | to mean the general observation
00:12:29.360 | of the exponential improvement
00:12:30.760 | of computational capabilities.
00:12:32.900 | So the question that's been asked many times
00:12:35.140 | over the past several decades, is Moore's law dead?
00:12:38.120 | I think has two camps.
00:12:40.440 | Majority of the industry says yes.
00:12:43.840 | And then there's a few folks like Jim Keller of now Intel.
00:12:47.280 | I did a podcast with him, I highly recommend it.
00:12:50.720 | Says no, because actually when we look at the size
00:12:54.120 | of transistors, we have not yet hit
00:12:56.080 | the theoretical physics limit
00:12:57.760 | of how small we can get with the transistors.
00:13:00.880 | Now it gets extremely difficult for many reasons
00:13:04.620 | to get a transistor that starts approaching the size
00:13:07.160 | of a single nanometer in terms of power,
00:13:10.640 | in terms of error correction,
00:13:12.840 | in terms of what's required for actual fabrication
00:13:15.720 | of that kind of scale of thing.
00:13:18.040 | But the theoretical physics limit hasn't been reached
00:13:20.680 | so Moore's law can continue.
00:13:22.340 | But also if we look at the broader definition
00:13:26.040 | of just exponential improvement
00:13:27.920 | of computational capabilities,
00:13:29.820 | there's a lot of other candidates, flag bearers,
00:13:32.480 | as I mentioned, that could carry
00:13:34.400 | that exponential flag forward.
00:13:36.440 | And let's look at them now.
00:13:37.840 | One is the global compute capacity.
00:13:41.540 | Now this one is really interesting.
00:13:43.120 | And I actually had trouble finding good data
00:13:45.780 | to answer the very basic question.
00:13:47.360 | I don't think that data exists.
00:13:48.960 | The question being how much total compute capacity
00:13:53.220 | is there in the world today?
00:13:55.240 | And looking historically, how has it been increasing?
00:13:58.280 | There's a few kind of speculative studies.
00:14:00.320 | Some of them I cite here.
00:14:01.460 | They're really interesting,
00:14:02.800 | but I do wish there was a little bit more data.
00:14:04.960 | I'm actually really excited by the potential of this
00:14:09.240 | in a way that is completely unexpected
00:14:12.000 | potentially in the future.
00:14:13.360 | Now, what are we talking about?
00:14:14.880 | We're talking about the actual number
00:14:16.900 | of general compute capable devices in the world.
00:14:20.760 | One of the really powerful compute devices
00:14:22.920 | that appeared over the past 20 years is gaming consoles.
00:14:27.800 | The other one, I mean, past maybe 10 years
00:14:30.480 | is a smartphone devices.
00:14:32.580 | Now, if we look into the future,
00:14:34.240 | the possibility, first of all,
00:14:35.680 | smartphone devices growing exponentially,
00:14:38.200 | but also the compute surface across all types of devices.
00:14:42.840 | So if we think of internet of things,
00:14:44.880 | every object in our day-to-day life
00:14:46.600 | gaining computational capabilities
00:14:48.860 | means that that computation can be then leveraged
00:14:51.280 | in some distributed way.
00:14:52.920 | And then we can look at an entirely other dimension
00:14:55.880 | of devices that could explode exponentially
00:14:58.240 | in the near or the long-term future
00:15:00.920 | of virtual reality and augmented reality devices.
00:15:04.520 | So currently both of those types of devices
00:15:06.400 | are not really gaining ground,
00:15:08.120 | but it's very possible that in the future,
00:15:13.000 | a huge amount of computational resources
00:15:14.880 | become available for these virtual worlds,
00:15:17.760 | for augmented worlds.
00:15:19.800 | So I'm actually really excited by the potential things
00:15:22.640 | that we can't yet expect
00:15:24.600 | in terms of the exponential growth of actual devices,
00:15:27.720 | which are able to do computation.
00:15:29.880 | The exponential expansion of compute surfaces in our world.
00:15:34.880 | That's really interesting.
00:15:36.280 | That might force us to rethink the nature of computation,
00:15:39.460 | to push it more and more
00:15:40.800 | towards like distributed computation.
00:15:44.200 | So speaking of distributed computation,
00:15:46.240 | another possibility of exponential growth of AI
00:15:49.280 | is just massively parallel computation.
00:15:52.220 | So increasing CPUs, GPUs,
00:15:54.680 | stacking them on top of each other
00:15:56.920 | and increasing that stack exponentially.
00:16:00.160 | Now you run up against Amdahl's law
00:16:03.040 | and all kinds of challenges that characterize
00:16:06.440 | that as you increase the number of processors,
00:16:10.880 | it becomes more and more difficult.
00:16:12.840 | There's a diminishing return
00:16:15.440 | in terms of the compute speed up you gain
00:16:17.240 | when you add more processors.
00:16:18.440 | Now, if we can overcome that Amdahl's law,
00:16:20.360 | if we can successfully design algorithms
00:16:23.580 | that are perfectly parallelizable across thousands,
00:16:27.500 | maybe millions, maybe billions of processors,
00:16:30.580 | then that changes the game.
00:16:32.580 | That changes the game
00:16:33.420 | and allows us to exponentially improve the AI algorithms
00:16:37.620 | by exponentially increasing
00:16:39.740 | the number of processors involved.
00:16:42.740 | Another dimension of approaches
00:16:44.580 | that contribute to exponential growth of AI
00:16:47.420 | is devices that are at their core parallelizable.
00:16:51.980 | More general devices like the GPUs,
00:16:54.860 | graphic processing units,
00:16:56.300 | or ones that are actually specific to neural networks
00:16:59.660 | or whatever the algorithm is,
00:17:00.980 | which is ASICs, application specific integrated circuits.
00:17:04.620 | The TPU by Google being an excellent example of that
00:17:07.140 | where there's a bunch of hardware design decisions made
00:17:09.660 | that are specialized to machine learning,
00:17:11.180 | allowing it to be much more efficient
00:17:13.100 | in terms of both energy use
00:17:14.500 | and the actual performance of the algorithm.
00:17:18.300 | Now, another big space
00:17:19.820 | that I could probably divide in many slides
00:17:22.300 | of flag bearers for exponential AI growth
00:17:25.640 | is changing the actual nature of computation.
00:17:30.020 | So a completely different kind of computation.
00:17:33.020 | So two exciting candidates shown here,
00:17:35.020 | one is quantum computing
00:17:36.620 | and the other is neuromorphic computing.
00:17:38.900 | You're probably familiar with quantum computers,
00:17:41.140 | with qubits, that versus classical computers
00:17:43.860 | that only represent zeros and ones,
00:17:45.540 | qubits also represent zero ones
00:17:47.660 | and the superposition of zero ones.
00:17:50.020 | So there is a lot of excitement and development
00:17:52.740 | in this space, but it's, I would say, very early days,
00:17:56.780 | especially considering general methods
00:18:00.300 | that are able to leverage computation.
00:18:02.140 | First, it's really hard to build large quantum computers,
00:18:06.980 | but even if you can, second,
00:18:09.020 | it's very hard to build algorithms
00:18:10.820 | that significantly outperform the algorithms
00:18:13.500 | on classical computers,
00:18:15.100 | especially in the space of artificial intelligence
00:18:17.380 | with machine learning.
00:18:19.060 | Then there's another space of computing
00:18:21.000 | called neuromorphic computing
00:18:22.700 | that draws a lot of inspiration,
00:18:24.620 | a lot more inspiration from the human brain.
00:18:28.480 | Specifically, it models spiking networks.
00:18:31.540 | Now, the idea, I think, with neuromorphic computing
00:18:34.500 | is it's able to perform computation
00:18:36.620 | in a much more efficient way.
00:18:38.300 | One of the characteristic things about the human brain
00:18:41.340 | versus our computers today
00:18:42.700 | is it's much more energy efficient than our computers
00:18:45.820 | for the same amount of computation.
00:18:48.720 | So neuromorphic computing is trying to achieve
00:18:51.460 | the same kind of performance.
00:18:53.180 | Again, very early days,
00:18:55.340 | and it's unclear how you can design general algorithms
00:18:59.300 | that reach even close to the same performance
00:19:02.420 | of machine learning algorithms, for example,
00:19:04.940 | run on classical computers of today with GPUs or ASICs.
00:19:08.620 | But of course, if you want to have a complete shift,
00:19:13.140 | like a phase shift in terms of the way
00:19:15.240 | we approach computation and artificial intelligence,
00:19:17.880 | a computer which functions in a completely different way
00:19:21.960 | than our classical computers is something
00:19:24.540 | that might be able to achieve that kind of phase shift.
00:19:27.180 | Now, another really exciting space of methodologies
00:19:30.220 | is brain-computer interfaces.
00:19:32.000 | In the short term, it's exciting
00:19:33.580 | because it may help us understand
00:19:36.460 | and treat neurological diseases.
00:19:38.740 | But in the long term,
00:19:40.520 | the possibility of leveraging human brains
00:19:44.780 | for computation, now that's a weird way to put it,
00:19:48.140 | but we have a lot of compute power in our brains.
00:19:50.900 | We're actually doing a lot of computation, each one of us,
00:19:54.540 | every living moment of our lives.
00:19:57.360 | And the unfortunate thing is we're not able
00:20:00.340 | to share the outcome of that computation with the world.
00:20:04.940 | We share it with a very low bandwidth channel.
00:20:08.020 | So not from an individual perspective,
00:20:11.380 | but from a perspective of society,
00:20:13.020 | it's interesting to consider if we can create
00:20:15.420 | a high bandwidth connection between a computer
00:20:17.220 | and a human brain, then we're able to leverage
00:20:21.100 | the computation the human brain already provides
00:20:24.000 | to be able to add to the global compute capacity
00:20:28.860 | available to the world.
00:20:30.820 | That's a really interesting possibility.
00:20:33.080 | The way I put it is a little bit ineloquent,
00:20:36.140 | but I think oftentimes when you talk
00:20:38.380 | about brain-computer interfaces, the way, for example,
00:20:41.100 | Elon Musk talks about Neuralink, it's often talked about
00:20:44.420 | from an individual perspective of increasing your ability
00:20:48.220 | to communicate with the world and receive information
00:20:50.260 | from the world, but if you look from a society perspective,
00:20:53.460 | you're now able to leverage the computational power
00:20:56.740 | of human brains, either the empty cycles
00:20:59.420 | or just the actual computation we'll perform
00:21:01.620 | to survive in our daily lives, able to leverage that
00:21:05.460 | to add to the global compute surface,
00:21:09.660 | the global capacity available in the world.
00:21:12.100 | And the human brain is quite an incredible
00:21:14.780 | computing machine, so if you can connect into that
00:21:18.340 | and share that computation, I think incredible
00:21:23.340 | exponential growth can be achieved
00:21:26.420 | without significant innovation on the algorithm side.
00:21:29.380 | Now, a lot of the previous things we talked about
00:21:31.400 | was more on the hardware side or at least
00:21:33.860 | very low-level software side of exponential improvement.
00:21:37.300 | I really like the recent paper from Danny Hernandez
00:21:41.180 | and others at OpenAI called "Measuring the Algorithmic
00:21:43.820 | Efficiency of Neural Networks" that looks at different
00:21:46.300 | kind of domains of machine learning and deep learning
00:21:48.800 | and shows that the efficiency of the algorithms involved
00:21:51.620 | has increased exponentially, actually far outpacing
00:21:54.540 | the improvement of Moore's law.
00:21:55.620 | So if we look at sort of the main one,
00:21:58.540 | starting from the ImageNet moment with AlexNet,
00:22:01.500 | neural network on a computer vision task,
00:22:03.140 | if we look at AlexNet in 2012 and then EfficientNet in 2019
00:22:08.140 | and all the networks that led up to it,
00:22:11.140 | the improvement is it takes 44 times less computation
00:22:16.140 | to train a neural network to the level of AlexNet.
00:22:19.020 | So if we look at Moore's law in the same kind of
00:22:22.340 | span of time, Moore's law would only observe
00:22:25.700 | a 11 times decrease in the cost.
00:22:29.540 | And the paper highlights also the same kind of
00:22:32.180 | exponential improvements in natural language,
00:22:34.460 | even in reinforcement learning.
00:22:36.220 | So the open question raises is maybe with deep learning,
00:22:40.140 | when we look at these learning methods,
00:22:42.060 | that the algorithmic process may yield more gains
00:22:45.500 | than hardware efficiency improvements.
00:22:48.140 | That's a really exciting possibility,
00:22:49.620 | especially for people working in the field,
00:22:51.500 | because that means human ingenuity will be essential
00:22:55.040 | for the continuation exponential improvement of AI.
00:22:58.700 | All that said, whether AI will continue to improve
00:23:02.060 | exponentially is an open question.
00:23:04.220 | I wanna sort of place a flag down.
00:23:06.460 | I don't know, I change my mind every day on most things,
00:23:08.540 | but today I feel AI will continue to improve exponentially.
00:23:12.340 | Now, exponential improvement is always
00:23:14.340 | just a stack of S-curves.
00:23:15.820 | It's not a single sort of nice exponential improvement.
00:23:19.140 | It's always kind of big breakthrough innovation
00:23:21.700 | on top of each other that level out
00:23:24.900 | and then a new innovation comes along.
00:23:27.980 | So the other open question is,
00:23:30.180 | where will the S-curves that feed the exponential
00:23:32.420 | come from most likely,
00:23:33.620 | out of the candidates that we discussed?
00:23:35.700 | So for me, the innovation in algorithms
00:23:38.340 | and innovation in supervised learning
00:23:40.540 | in how data is organized and leveraged
00:23:44.780 | in that learning process.
00:23:45.860 | So the efficiency of learning and search processes,
00:23:48.820 | especially with active learning.
00:23:50.740 | You know, there's a lot of terminology swimming around
00:23:52.860 | that's a little bit loose.
00:23:54.180 | So folks like Yann LeCun is really excited
00:23:56.780 | by self-supervised learning.
00:23:58.740 | And you can think of it,
00:23:59.980 | you can define it however the heck you want,
00:24:01.620 | but you can think of self-supervised learning
00:24:03.940 | as leveraging human annotation very little,
00:24:07.140 | leveraging human expertise very little.
00:24:09.980 | So that's looking at mechanisms that are extremely powerful
00:24:12.500 | like self-play and reinforcement learning,
00:24:14.860 | or in a video computer vision context,
00:24:17.220 | you have the idea would be that you would have an algorithm
00:24:20.740 | just watches YouTube videos all day.
00:24:22.820 | And from that is able to figure out
00:24:24.620 | the common sense reasoning,
00:24:26.100 | the physics of the world and so on
00:24:28.100 | in an unsupervised way, just by observing the world.
00:24:31.580 | Now, for me, I'm excited by active learning much more,
00:24:34.740 | which is the optimization of the way you select the data
00:24:38.420 | from which you learn from.
00:24:39.620 | You say, I'm going to learn,
00:24:40.740 | I'm going to become increasingly efficient.
00:24:42.700 | I'm going to learn from smaller and smaller data sets,
00:24:44.820 | but I'm going to be extremely selective
00:24:46.980 | about which part of the data I look at
00:24:49.300 | and annotate or ask human supervision over.
00:24:53.260 | I think a really simple, but exciting example
00:24:55.900 | of that in the real world is what the Tesla Autopilot team
00:24:58.220 | is doing by creating this pipeline
00:25:00.660 | where there's a multitask learning framework,
00:25:04.180 | where there's a bunch of different tasks.
00:25:06.140 | And there's a pipeline for discovering edge cases
00:25:09.100 | for each of the tasks.
00:25:10.060 | And you keep feeding back the edge cases discovered,
00:25:13.700 | and then you keep feeding those edge cases back
00:25:15.460 | and retraining the network over and over
00:25:17.060 | for each of the different tasks.
00:25:18.620 | And then there's a shared part of the network
00:25:21.020 | that keeps learning over time.
00:25:22.180 | And so there's this active learning framework
00:25:24.900 | that just keeps looping over and over
00:25:26.260 | and gets better and better over time
00:25:27.660 | as it continually discovers and learns from the edge cases.
00:25:32.660 | I think that's a very simple example
00:25:34.300 | of what I'm talking about,
00:25:35.140 | but I'm really excited by that possibility.
00:25:37.500 | So innovation and learning in terms of its ability
00:25:40.060 | to discover just the right data to improve its performance.
00:25:44.900 | And I believe the performance of active learning
00:25:46.860 | can increase exponentially in the coming years.
00:25:49.620 | Another source of ESCO is that I'm really excited about,
00:25:53.460 | but it's very unpredictable,
00:25:55.500 | is the general expansion of the compute surfaces
00:25:58.340 | in the world.
00:25:59.460 | So it's unclear, but it's very possible
00:26:03.820 | that the Internet of Things, IoT,
00:26:07.540 | eventually will come around
00:26:09.460 | where there's smart devices just everywhere.
00:26:12.140 | And we're not talking about Alexa here or there.
00:26:14.480 | We're talking about just everything as a compute surface.
00:26:19.340 | I think that's a really exciting possibility of the future.
00:26:21.980 | It may be far away,
00:26:23.300 | but I certainly hope to be part of the people
00:26:25.460 | that tries to create some of that future.
00:26:29.040 | So it's an exciting out there possibility.
00:26:32.100 | To me, the total game changer that we don't expect
00:26:35.140 | that seems crazy,
00:26:37.020 | especially when Elon Musk talks about it
00:26:38.820 | in the context of Neuralink, is brain-computer interfaces.
00:26:41.960 | I think that's a really exciting technology
00:26:44.540 | for helping understand and treat neurological diseases.
00:26:49.020 | But if you can make it work
00:26:50.940 | to where a computer can communicate
00:26:52.580 | in a high bandwidth way with a brain,
00:26:54.940 | a two-way communication,
00:26:56.780 | that's going to change the nature of computation
00:26:59.480 | and the nature of artificial intelligence completely.
00:27:02.380 | If an AI system can communicate with the human brain
00:27:05.180 | and each leveraging each other's computation,
00:27:07.860 | I don't think we can even imagine
00:27:11.420 | the kind of world that that would create.
00:27:13.820 | That's a really exciting possibility,
00:27:16.760 | but at this time, it's shrouded in uncertainty.
00:27:21.160 | It seems impossible and crazy,
00:27:23.840 | but if anyone can do it,
00:27:24.920 | it's the folks working on brain-computer interfaces
00:27:27.040 | and certainly folks like Elon Musk
00:27:28.840 | and the brilliant engineers working at Neuralink.
00:27:32.920 | When you talk about exponential improvement in AI,
00:27:35.980 | the natural question that people ask is,
00:27:37.880 | "When is the singularity coming?
00:27:39.480 | "Is it 2030, 2045, 2050, a century from now?"
00:27:44.960 | Again, I don't have firm beliefs on this,
00:27:48.320 | but from my perspective,
00:27:49.920 | I think we're living through the singularity.
00:27:52.520 | I think the smoothness of the exponential improvement
00:27:55.960 | that we've been a part of in artificial intelligence
00:27:59.600 | is sufficiently smooth to where we don't even sense
00:28:02.400 | the madness of the curvature of the improvement
00:28:05.840 | that we've been living through.
00:28:06.800 | I think it's been just incredible,
00:28:08.860 | and every new stage we just so quickly take for granted.
00:28:13.400 | I think we're living through the singularity,
00:28:15.120 | and I think we'll continue adapting incredibly well
00:28:19.360 | to the exponential improvement of AI.
00:28:22.040 | I can't wait to what the future holds.
00:28:24.320 | So that was my simple attempt to discuss some of the ideas
00:28:27.240 | by Rich Sutton in his blog post, "The Bitter Lesson,"
00:28:30.720 | and the broader context of exponential improvement in AI
00:28:33.720 | and the role of computation and algorithmic improvement
00:28:36.240 | in that exponential improvement.
00:28:38.440 | This has been part of the AI Paper Club
00:28:41.520 | on our Discord server.
00:28:43.000 | You're welcome to join anytime.
00:28:44.400 | It's not just AI.
00:28:45.320 | It's people from all walks of life, all levels of expertise,
00:28:49.240 | from artists to musicians to neuroscientists to physicists.
00:28:53.680 | It's kind of an incredible community,
00:28:55.040 | and I really enjoyed being part of it.
00:28:56.800 | It's, I think, something special, so join us anytime.
00:29:01.440 | If you have suggestions for papers we should cover,
00:29:03.520 | let me know.
00:29:04.680 | Otherwise, thanks for watching, and I'll see you next time.
00:29:23.640 | [BLANK_AUDIO]