back to indexMoore's Law is Not Dead (Jim Keller) | AI Podcast Clips
Chapters
0:0
0:20 What Is Moore's Law
1:25 Broader Definition of Moore's Law
14:52 Building Blocks of Mathematics
19:40 Nanowires
00:00:00.000 |
For over 50 years now, Moore's Law has served for me and millions of others as an inspiring 00:00:08.760 |
beacon of what kind of amazing future brilliant engineers can build. 00:00:14.120 |
I'm just making your kids laugh all of today. 00:00:18.420 |
So first, in your eyes, what is Moore's Law, if you could define for people who don't know? 00:00:25.960 |
Well the simple statement was, from Gordon Moore, was double the number of transistors 00:00:34.200 |
And then my operational model is, we increase the performance of computers by 2x every two 00:00:43.400 |
And it's wiggled around substantially over time. 00:00:46.340 |
And also, in how we deliver performance has changed. 00:00:51.320 |
But the foundational idea was 2x the transistors every two years. 00:00:57.680 |
The current cadence is something like, they call it a shrink factor, like .6 every two 00:01:06.720 |
But that's referring strictly, again, to the original definition of just-- 00:01:11.320 |
A shrink factor, just getting them smaller and smaller and smaller. 00:01:16.540 |
If you make the transistors smaller by .6, then you get one over .6 more transistors. 00:01:24.700 |
What's a broader-- what do you think should be the broader definition of Moore's Law? 00:01:28.340 |
When you mentioned how you think of performance, just broadly, what's a good way to think about 00:01:36.420 |
Well, first of all, I've been aware of Moore's Law for 30 years. 00:01:47.780 |
You're just watching it before your eyes, kind of thing. 00:01:50.620 |
Well, and somewhere where I became aware of it, I was also informed that Moore's Law was 00:01:58.820 |
But then after 10 years, it was going to die in 10 to 15 years. 00:02:02.380 |
And then at one point, it was going to die in five years. 00:02:06.180 |
And at some point, I decided not to worry about that particular prognostication for 00:02:14.500 |
And then I joined Intel, and everybody said Moore's Law is dead. 00:02:17.660 |
And I thought, that's sad, because it's the Moore's Law company. 00:02:24.260 |
And humans like these apocryphal kind of statements, like, we'll run out of food, or we'll run 00:02:30.340 |
out of air, or run out of room, or run out of something. 00:02:35.100 |
But it's still incredible that it's lived for as long as it has. 00:02:39.540 |
And yes, there's many people who believe now that Moore's Law is dead. 00:02:44.860 |
You know, they can join the last 50 years of people who had the same idea. 00:02:50.340 |
But why do you think, if you can try to understand it, why do you think it's not dead currently? 00:02:58.100 |
Well, first, let's just think, people think Moore's Law is one thing. 00:03:04.020 |
But actually, under the sheet, there's literally thousands of innovations. 00:03:07.360 |
And almost all those innovations have their own diminishing return curves. 00:03:12.180 |
So if you graph it, it looks like a cascade of diminishing return curves. 00:03:25.660 |
So if you're an expert in one of the things on a diminishing return curve, right, and 00:03:30.860 |
you can see its plateau, you will probably tell people, well, this is done. 00:03:36.540 |
Meanwhile, some other pile of people are doing something different. 00:03:43.100 |
So then there's the observation of how small could a switching device be? 00:03:48.900 |
So a modern transistor is something like a thousand by a thousand by a thousand atoms, 00:03:54.860 |
And you get quantum effects down around two to ten atoms. 00:03:59.500 |
So you can imagine a transistor as small as 10 by 10 by 10. 00:04:06.940 |
And then the quantum computational people are working away at how to use quantum effects. 00:04:12.300 |
So a thousand by a thousand by a thousand atoms. 00:04:21.500 |
Well, a fan, like a modern transistor, if you look at the fan, it's like 120 atoms wide. 00:04:33.740 |
And a competent transistor designer could count both atoms in every single direction. 00:04:42.820 |
Like there's techniques now to already put down atoms in a single atomic layer. 00:04:50.720 |
It's just from a manufacturing process, placing an atom takes 10 minutes. 00:04:56.140 |
And you need to put 10 to the 23rd atoms together to make a computer. 00:05:03.660 |
So the methods are both shrinking things and then coming up with effective ways to control 00:05:19.380 |
There's equipment, there's optics, there's chemistry, there's physics, there's material 00:05:25.940 |
There's lots of ideas about when you put different materials together, how do they interact? 00:05:36.300 |
There's literally thousands of technologies involved. 00:05:39.820 |
- But just for the shrinking, you don't think we're quite yet close to the fundamental limits 00:05:45.380 |
- I did a talk on Moore's Law and I asked for a roadmap to a path of 100. 00:05:49.700 |
And after two weeks, they said we only got to 50. 00:05:57.260 |
- The 50 and I said, "Why don't you give it another two weeks?" 00:06:04.460 |
So I believe that the next 10 or 20 years of shrinking is gonna happen. 00:06:11.180 |
Now as a computer designer, you have two stances. 00:06:15.780 |
You think it's going to shrink, in which case you're designing and thinking about architecture 00:06:23.820 |
Or conversely, not be swamped by the complexity of all the transistors you get. 00:06:34.620 |
- You're open to the possibility and waiting for the possibility of a whole new army of 00:06:40.660 |
- I'm expecting more transistors every two or three years by a number large enough that 00:06:47.700 |
how you think about design, how you think about architecture has to change. 00:06:52.060 |
Like imagine you build buildings out of bricks and every year the bricks are half the size 00:07:00.700 |
Well if you kept building bricks the same way, so many bricks per person per day, the 00:07:06.220 |
amount of time to build a building would go up exponentially. 00:07:09.660 |
But if you said, "I know that's coming, so now I'm gonna design equipment that moves 00:07:16.140 |
bricks faster, uses them better," because maybe you're getting something out of the 00:07:19.420 |
smaller bricks, more strength, thinner walls, less material efficiency out of that. 00:07:25.180 |
So once you have a roadmap with what's gonna happen, transistors, we're gonna get more 00:07:30.140 |
of them, then you design all this collateral around it to take advantage of it and also 00:07:38.940 |
If I didn't believe in Moore's law and then Moore's law transistors showed up, my design 00:07:46.540 |
- So what's the hardest part of this influx of new transistors? 00:07:52.060 |
I mean, even if you just look historically throughout your career, what's the thing, 00:07:59.060 |
what fundamentally changes when you add more transistors in the task of designing an architecture? 00:08:10.860 |
- By the way, there's some science showing that we do get smarter because of nutrition, 00:08:25.740 |
- I would believe for the most part people aren't getting much smarter. 00:08:35.700 |
So human beings, we're really good in teams of 10, up to teams of 100, they can know each 00:08:43.460 |
Beyond that, you have to have organizational boundaries. 00:08:45.620 |
So you're kind of, you have, those are pretty hard constraints, right? 00:08:51.260 |
Like as the designs get bigger, you have to divide it into pieces. 00:08:55.020 |
You know, the power of abstraction layers is really high. 00:08:57.980 |
We used to build computers out of transistors. 00:09:00.900 |
Now we have a team that turns transistors into logic cells and another team that turns 00:09:04.340 |
them into functional units and another one that turns them into computers, right? 00:09:11.100 |
And you have to think about when do you shift gears on that? 00:09:16.100 |
We also use faster computers to build faster computers. 00:09:19.060 |
So some algorithms run twice as fast on new computers, but a lot of algorithms are N squared. 00:09:25.220 |
So a computer with twice as many transistors in it might take four times as long to run. 00:09:33.540 |
Like simply using faster computers to build bigger computers doesn't work. 00:09:41.020 |
So in terms of computing performance and the exciting possibility that more powerful computers 00:09:45.300 |
bring is shrinking the thing we've just been talking about. 00:09:51.420 |
One of the, for you, one of the biggest exciting possibilities of advancement in performance, 00:09:56.260 |
or is there other directions that you're interested in? 00:09:59.900 |
Like in the direction of sort of enforcing given parallelism or like doing massive parallelism 00:10:06.780 |
in terms of many, many CPUs, stacking CPUs on top of each other, that kind of parallelism 00:10:17.020 |
So old computers, slow computers, you said A equal B plus C times D. Pretty simple, right? 00:10:25.420 |
And then we made faster computers with vector units and you can do proper equations and 00:10:33.340 |
And then modern like AI computations or like convolutional neural networks where you convolve 00:10:41.980 |
And so there's sort of this hierarchy of mathematics, from simple equation to linear equations, 00:10:48.820 |
to matrix equations, to deeper kind of computation. 00:10:53.620 |
And the datasets are getting so big that people are thinking of data as a topology problem. 00:11:02.860 |
And then the computation, which sort of wants to be get data from immense shape and do some 00:11:10.140 |
So what computers have allowed people to do is have algorithms go much, much further. 00:11:17.320 |
So that paper you referenced, the Sutton paper, they talked about, like when AI started, it 00:11:26.740 |
That's a very simple computational situation. 00:11:30.660 |
And then when they did first chess thing, they solved deep searches. 00:11:34.740 |
So have a huge database of moves and results, deep search, but it's still just a search. 00:11:42.980 |
Now we take large numbers of images and we use it to train these weight sets that we 00:11:49.340 |
convolve across to completely different kind of phenomena. 00:11:57.260 |
And if you look at it, they're going up this mathematical graph, right? 00:12:02.420 |
And then computations, both computation and datasets support going up that graph. 00:12:08.300 |
- Yeah, the kind of computation that might, I mean, I would argue that all of it is still 00:12:14.900 |
Just like you said, a topology problem of datasets, you're searching the datasets for 00:12:21.940 |
And also the actual optimization of neural networks is a kind of search for the-- 00:12:27.180 |
- I don't know, if you had looked at the inner layers of finding a cat, it's not a search. 00:12:35.860 |
So projection, here's a shadow of this phone, right? 00:12:40.500 |
And then you can have a shadow of that on something, a shadow on that of something. 00:12:44.100 |
If you look in the layers, you'll see this layer actually describes pointy ears and round 00:12:48.820 |
eyedness and fuzziness, but the computation to tease out the attributes is not search. 00:12:58.940 |
- Like the inference part might be search, but the training is not search. 00:13:02.740 |
And then in deep networks, they look at layers and they don't even know it's represented. 00:13:09.260 |
And yet if you take the layers out, it doesn't work. 00:13:14.380 |
- But you have to talk to a mathematician about what that actually is. 00:13:17.140 |
- Well, we could disagree, but it's just semantics, I think. 00:13:23.860 |
- I would say it's absolutely not semantics, but-- 00:13:26.740 |
- Okay, all right, well, if you wanna go there. 00:13:31.860 |
So optimization to me is search, and we're trying to optimize the ability of a neural 00:13:40.980 |
And the difference between chess and the space, the incredibly multi-dimensional, 100,000 00:13:50.620 |
dimensional space that neural networks are trying to optimize over is nothing like the 00:14:06.020 |
The funny thing is, it's the difference between given search space and found search space. 00:14:12.180 |
- Yeah, maybe that's a different way to describe it. 00:14:15.180 |
- But you're saying, what's your sense in terms of the basic mathematical operations 00:14:19.500 |
and the architectures, hardware that enables those operations? 00:14:24.700 |
Do you see the CPUs of today still being a really core part of executing those mathematical 00:14:33.460 |
Well, the operations continue to be add, subtract, load, store, compare, and branch. 00:14:40.960 |
So it's interesting that the building blocks of computers are transistors, and under that, 00:14:47.820 |
So you got atoms, transistors, logic gates, computers, functional units of computers. 00:14:53.180 |
The building blocks of mathematics at some level are things like adds and subtracts and 00:14:58.420 |
multiplies, but the space mathematics can describe is, I think, essentially infinite. 00:15:06.180 |
But the computers that run the algorithms are still doing the same things. 00:15:11.140 |
Now, a given algorithm might say, "I need sparse data," or, "I need 32-bit data," or, 00:15:16.980 |
"I need a convolution operation that naturally takes 8-bit data, multiplies it, and sums 00:15:26.500 |
So the data types in TensorFlow imply an optimization set, but when you go right down and look at 00:15:34.380 |
the computers, it's and and or gates doing adds and multiplies. 00:15:40.580 |
Now, the quantum researchers think they're going to change that radically, and then there's 00:15:45.260 |
people who think about analog computing, because you look in the brain and it seems to be more 00:15:48.660 |
analog-ish, that maybe there's a way to do that more efficiently. 00:15:54.060 |
We have a million X on computation, and I don't know the relationship between computational, 00:16:04.140 |
let's say, intensity and ability to hit mathematical abstractions. 00:16:09.900 |
I don't know any ways to describe that, but just like you saw in AI, you went from rule 00:16:16.100 |
sets to simple search to complex search to, say, found search. 00:16:21.660 |
Those are orders of magnitude more computation to do. 00:16:26.380 |
And as we get the next two orders of magnitude, like a friend, Roger Godori, said, "Every 00:16:35.540 |
- Fundamentally changes what the computation is doing. 00:16:39.140 |
Oh, you know the expression, "The difference in quantity is the difference in kind." 00:16:44.300 |
You know, the difference between ant and anthill, right? 00:16:50.780 |
There's this indefinable place where the quantity changed the quality, right? 00:16:57.300 |
And we've seen that happen in mathematics multiple times, and my guess is it's gonna 00:17:03.380 |
- So, in your sense, is it, yeah, if you focus head down and shrinking the transistor... 00:17:10.540 |
We're aware of the software stacks that are running the computational loads, and we're 00:17:15.980 |
kind of pondering what do you do with a petabyte of memory that wants to be accessed in a sparse 00:17:20.740 |
way and have the kind of calculations AI programmers want. 00:17:27.540 |
So there's a dialogue interaction, but when you go in the computer chip, you find adders 00:17:37.980 |
- So if you zoom out then with, as you mentioned, Rich Sutton, the idea that most of the development 00:17:44.020 |
in the last many decades in AI research came from just leveraging computation and just 00:17:50.060 |
simple algorithms waiting for the computation to improve. 00:17:54.740 |
- Well, software guys have a thing that they call the problem of early optimization. 00:18:01.900 |
So you write a big software stack, and if you start optimizing the first thing you write, 00:18:07.180 |
the odds of that being the performance limiter is low. 00:18:09.740 |
But when you get the whole thing working, can you make it 2x faster by optimizing the 00:18:15.820 |
While you're optimizing that, could you have written a new software stack, which would 00:18:25.060 |
- But the whole time as you're doing the writing, that's the software we're talking about. 00:18:29.700 |
The hardware underneath gets faster and faster. 00:18:32.860 |
If Moore's Law is going to continue, then your AI research should expect that to show 00:18:41.020 |
And then you make a slightly different set of choices. 00:18:46.180 |
And from here, it's just us rewriting algorithms. 00:18:50.020 |
That seems like a failed strategy for the last 30 years of Moore's Law's death. 00:18:57.900 |
I think you've answered it, but I'll just ask the same dumb question over and over. 00:19:01.780 |
So why do you think Moore's Law is not going to die? 00:19:07.580 |
Which is the most promising, exciting possibility of why it won't die in the next five, 10 years? 00:19:12.780 |
So is it the continued shrinking of the transistor, or is it another S-curve that steps in and 00:19:20.180 |
- Well, shrinking the transistor is literally thousands of innovations. 00:19:28.140 |
There's a whole bunch of S-curves just kind of running their course and being reinvented 00:19:36.500 |
The semiconductor fabricators and technologists have all announced what's called nanowires. 00:19:42.220 |
So they took a fin which had a gate around it and turned that into little wires so you 00:19:47.620 |
have better control of that and they're smaller. 00:19:50.220 |
And then from there, there are some obvious steps about how to shrink that. 00:19:54.460 |
The metallurgy around wire stacks and stuff has very obvious abilities to shrink. 00:20:02.220 |
And there's a whole combination of things there to do. 00:20:05.820 |
- Your sense is that we're going to get a lot if this innovation from just that shrinking. 00:20:10.340 |
- Yeah, like a factor of a hundred, it's a lot. 00:20:19.940 |
- Now you're smart and you might know, but to me it's totally unpredictable of what that 00:20:23.460 |
hundred X would bring in terms of the nature of the computation that people would be. 00:20:32.100 |
So for a long time it was mainframes, minis, workstation, PC, mobile. 00:20:41.060 |
And then when we were thinking about Moore's law, Rajagirdari said every 10 X generates 00:20:48.100 |
So scalar, vector, matrix, topological computation. 00:20:56.060 |
And if you go look at the industry trends, there was mainframes and then mini computers 00:21:02.220 |
And then the internet took off and then we got mobile devices and now we're building 00:21:09.940 |
And people are starting to think about the smart world where everything knows you, recognizes 00:21:15.380 |
you, like the transformations are going to be like unpredictable. 00:21:22.280 |
- How does it make you feel that you're one of the key architects of this kind of future? 00:21:29.940 |
So we're not talking about the architects of the high level, people who build the Angry 00:21:39.900 |
Maybe that's the whole point of the universe. 00:21:41.620 |
- I'm going to take a stand at that and the attention distracting nature of mobile phones. 00:21:53.180 |
- The side effects of smartphones or the attention distraction, which part? 00:22:02.980 |
- My parents used to yell at my sisters for hiding in the closet with a wired phone with 00:22:10.620 |
- And now my wife yells at my kids for talking to their friends all day on text. 00:22:18.500 |
But you are one of the key people architecting the hardware of this future. 00:22:30.900 |
- So we're in a social context, so there's billions of people on this planet. 00:22:35.740 |
There are literally millions of people working on technology. 00:22:39.380 |
I feel lucky to be doing what I do and getting paid for it, and there's an interest in it. 00:22:47.660 |
But there's so many things going on in parallel. 00:22:56.500 |
The vectors of all these different things are happening all the time. 00:23:01.460 |
There's a, I'm sure some philosopher or meta philosophers wondering about how we transform 00:23:11.060 |
- So you can't deny the fact that these tools are changing our world. 00:23:19.860 |
- So do you think it's changing for the better? 00:23:24.300 |
- I read this thing recently, it said the two disciplines with the highest GRE scores 00:23:34.540 |
And they're both sort of trying to answer the question, why is there anything? 00:23:38.940 |
And the philosophers are on the theological side, and the physicists are obviously on 00:23:47.460 |
And there's 100 billion galaxies with 100 billion stars. 00:24:01.500 |
It's hard to say what it's all for, if that's what you're asking. 00:24:05.300 |
- Things do tend to significantly increases in complexity. 00:24:11.300 |
And I'm curious about how computation, like our world, our physical world inherently generates 00:24:22.220 |
So we have XYZ coordinates, you take a sphere, you make it bigger, you get a surface that 00:24:28.900 |
Like it generally generates mathematics and the mathematicians and the physicists have 00:24:33.540 |
been having a lot of fun talking to each other for years. 00:24:36.100 |
And computation has been, let's say, relatively pedestrian. 00:24:40.980 |
Like computation in terms of mathematics has been doing binary algebra, while those guys 00:24:47.320 |
have been gallivanting through the other realms of possibility, right? 00:24:52.900 |
Now recently, the computation lets you do mathematical computations that are sophisticated 00:25:00.460 |
enough that nobody understands how the answers came out, right? 00:25:06.980 |
It used to be, you get data set, you guess at a function, the function is considered 00:25:12.660 |
physics if it's predictive of new functions, new data sets. 00:25:18.180 |
Modern, you can take a large data set with no intuition about what it is and use machine 00:25:25.340 |
learning to find a pattern that has no function, right? 00:25:29.340 |
And it can arrive at results that I don't know if they're completely mathematically 00:25:34.940 |
So computation has kind of done something interesting compared to A equal B plus C.