back to index

Jim Keller: Abstraction Layers from the Atom to the Data Center | AI Podcast Clips


Whisper Transcript | Transcript Only Page

00:00:00.000 | - So let's get into the basics before we zoom back out.
00:00:05.000 | How do you build a computer from scratch?
00:00:08.820 | What is a microprocessor?
00:00:10.580 | What is a microarchitecture?
00:00:11.940 | What's an instruction set architecture?
00:00:14.460 | Maybe even as far back as what is a transistor?
00:00:17.280 | - So the special charm of computer engineering
00:00:22.840 | is there's a relatively good understanding
00:00:26.220 | of abstraction layers.
00:00:28.260 | So down at the bottom you have atoms,
00:00:30.080 | and atoms get put together in materials like silicon
00:00:33.280 | or dope silicon or metal, and we build transistors.
00:00:37.240 | On top of that, we build logic gates, right?
00:00:41.500 | And then functional units, like an adder, a subtractor,
00:00:45.240 | an instruction parsing unit, and then we assemble those
00:00:47.700 | into processing elements.
00:00:50.120 | Modern computers are built out of probably 10 to 20
00:00:55.040 | locally organic processing elements
00:00:58.760 | or coherent processing elements,
00:01:00.440 | and then that runs computer programs, right?
00:01:04.440 | So there's abstraction layers, and then software,
00:01:07.560 | there's an instruction set you run,
00:01:09.560 | and then there's assembly language C, C++, Java, JavaScript.
00:01:14.240 | There's abstraction layers, essentially from the atom
00:01:17.760 | to the data center, right?
00:01:20.320 | So when you build a computer,
00:01:24.520 | first there's a target, like what's it for?
00:01:26.320 | Like how fast does it have to be?
00:01:27.740 | Which today there's a whole bunch of metrics
00:01:30.000 | about what that is.
00:01:31.620 | And then in an organization of 1,000 people
00:01:34.840 | who build a computer, there's lots of different disciplines
00:01:39.840 | that you have to operate on.
00:01:41.920 | Does that make sense?
00:01:43.280 | And so--
00:01:44.120 | - So there's a bunch of levels of abstraction.
00:01:47.380 | In an organization like Intel, and in your own vision,
00:01:53.560 | there's a lot of brilliance that comes in
00:01:55.400 | at every one of those layers.
00:01:57.500 | Some of it is science, some of it is engineering,
00:01:59.480 | some of it is art.
00:02:01.160 | What's the most, if you could pick favorites,
00:02:04.160 | what's the most important, your favorite layer
00:02:07.080 | on these layers of abstractions?
00:02:08.900 | Where does the magic enter this hierarchy?
00:02:11.740 | - I don't really care.
00:02:14.920 | That's the fun, you know, I'm somewhat agnostic to that.
00:02:18.560 | So I would say for relatively long periods of time,
00:02:23.320 | instruction sets are stable.
00:02:25.840 | So the x86 instruction set, the ARM instruction set.
00:02:29.800 | - What's an instruction set?
00:02:31.160 | - So it says, how do you encode the basic operations?
00:02:33.920 | Load, store, multiply, add, subtract, conditional branch.
00:02:37.420 | There aren't that many interesting instructions.
00:02:41.600 | Like if you look at a program and it runs,
00:02:44.240 | 90% of the execution is on 25 opcodes, 25 instructions.
00:02:49.240 | And those are stable, right?
00:02:51.720 | - What does it mean, stable?
00:02:53.280 | - Intel architecture has been around for 25 years.
00:02:55.960 | - It works.
00:02:56.800 | - It works.
00:02:57.620 | And that's because the basics, you know,
00:03:00.320 | are defined a long time ago, right?
00:03:03.100 | Now, the way an old computer ran is you fetched instructions
00:03:08.040 | and you executed them in order.
00:03:10.800 | Do the load, do the add, do the compare.
00:03:13.960 | The way a modern computer works is you fetch
00:03:17.600 | large numbers of instructions, say 500.
00:03:22.000 | And then you find the dependency graph
00:03:24.040 | between the instructions.
00:03:25.720 | And then you execute in independent units,
00:03:30.100 | those little micrographs.
00:03:32.200 | So a modern computer, like people like to say,
00:03:35.560 | computers should be simple and clean.
00:03:38.520 | But this turns out the market for a simple,
00:03:40.180 | complete, clean, slow computers is zero, right?
00:03:44.040 | We don't sell any simple, clean computers.
00:03:47.360 | Now you can, there's how you build it can be clean,
00:03:51.320 | but the computer people want to buy,
00:03:54.480 | that's say in a phone or a data center,
00:03:58.200 | fetches a large number of instructions,
00:04:00.440 | computes the dependency graph,
00:04:03.360 | and then executes it in a way that gets the right answers.
00:04:06.920 | - And optimize that graph somehow.
00:04:08.640 | - Yeah, they run deeply out of order.
00:04:11.280 | And then there's semantics around how memory ordering works
00:04:15.320 | and other things work.
00:04:16.160 | So the computer sort of has a bunch of bookkeeping tables
00:04:19.720 | that says what order do these operations finish in
00:04:23.280 | or appear to finish in.
00:04:25.560 | But to go fast, you have to fetch a lot of instructions
00:04:28.480 | and find all the parallelism.
00:04:30.480 | Now there's a second kind of computer,
00:04:33.240 | which we call GPUs today.
00:04:35.340 | And I call it the difference.
00:04:37.400 | There's found parallelism,
00:04:38.760 | like you have a program with a lot of dependent instructions.
00:04:41.880 | You fetch a bunch and then you go figure out
00:04:43.880 | the dependency graph and you issue instructions out of order.
00:04:47.200 | That's because you have one serial narrative to execute,
00:04:50.760 | which in fact can be done out of order.
00:04:53.640 | - You call it a narrative?
00:04:54.880 | - Yeah.
00:04:55.720 | - Wow.
00:04:56.560 | - So yeah, so humans think in serial narrative.
00:04:58.480 | So read a book, right?
00:05:00.760 | There's a sentence after sentence after sentence
00:05:03.560 | and there's paragraphs.
00:05:04.640 | Now you could diagram that.
00:05:07.160 | Imagine you diagrammed it properly and you said,
00:05:09.640 | which sentences could be read in any order
00:05:14.040 | without changing the meaning, right?
00:05:17.780 | - That's a fascinating question to ask of a book.
00:05:19.980 | Yeah.
00:05:20.820 | - Yeah, you could do that, right?
00:05:22.180 | So some paragraphs could be reordered,
00:05:24.100 | some sentences can be reordered.
00:05:26.220 | You could say, he is tall and smart and X, right?
00:05:31.220 | And it doesn't matter the order of tall and smart.
00:05:36.020 | But if you say the tall man is wearing a red shirt,
00:05:40.720 | what colors, you know, like you can create dependencies.
00:05:44.860 | Right?
00:05:46.020 | Right, and so GPUs on the other hand
00:05:49.820 | run simple programs on pixels,
00:05:53.100 | but you're given a million of them.
00:05:54.660 | And the first order, the screen you're looking at
00:05:57.940 | doesn't care which order you do it in.
00:05:59.980 | So I call that given parallelism.
00:06:02.260 | Simple narratives around the large numbers of things
00:06:06.080 | where you can just say it's parallel
00:06:08.220 | because you told me it was.
00:06:10.100 | - So found parallelism where the narrative is sequential
00:06:15.480 | but you discover like little pockets of parallelism
00:06:18.900 | versus--
00:06:19.740 | - Turns out large pockets of parallelism.
00:06:21.780 | - Large, so how hard is it to discover?
00:06:23.660 | - Well, how hard is it?
00:06:24.740 | That's just transistor count, right?
00:06:26.620 | So once you crack the problem, you say,
00:06:28.940 | here's how you fetch 10 instructions at a time,
00:06:31.260 | here's how you calculated the dependencies between them,
00:06:34.140 | here's how you describe the dependencies,
00:06:36.300 | here's, you know, these are pieces, right?
00:06:38.460 | - So once you describe the dependencies,
00:06:43.360 | then it's just a graph.
00:06:45.380 | Sort of it's an algorithm that finds,
00:06:47.980 | what is that?
00:06:49.740 | I'm sure there's a graph theory,
00:06:51.380 | theoretical answer here that's solvable.
00:06:53.660 | In general, programs, modern programs
00:06:58.500 | that human beings write,
00:07:00.020 | how much found parallelism is there in them?
00:07:02.980 | - About 10x.
00:07:03.820 | - What does 10x mean?
00:07:04.660 | - So if you execute it in order.
00:07:07.500 | - Versus, yeah.
00:07:09.300 | - You would get what's called cycles per instruction
00:07:11.700 | and it would be about, you know, three instructions,
00:07:15.980 | three cycles per instruction
00:07:17.780 | because of the latency of the operations and stuff.
00:07:20.540 | And in a modern computer,
00:07:22.260 | executes it like 0.25 cycles per instruction.
00:07:26.480 | So it's about, we today find 10x.
00:07:29.580 | And there's two things.
00:07:30.780 | One is the found parallelism in the narrative, right?
00:07:35.140 | And the other is to predictability of the narrative, right?
00:07:39.140 | So certain operations, they do a bunch of calculations
00:07:43.280 | and if greater than one, do this, else do that.
00:07:46.640 | That decision is predicted in modern computers
00:07:50.940 | to high 90% accuracy.
00:07:53.980 | So branches happen a lot.
00:07:56.500 | So imagine you have a decision
00:07:58.180 | to make every six instructions,
00:07:59.540 | which is about the average, right?
00:08:01.540 | But you want to fetch 500 instructions,
00:08:03.240 | figure out the graph and execute them all in parallel.
00:08:06.220 | That means you have, let's say,
00:08:09.380 | if you fix 600 instructions and it's every six,
00:08:12.780 | you have to fetch,
00:08:13.860 | you have to predict 99 out of a hundred branches correctly
00:08:17.180 | for that window to be effective.
00:08:20.140 | - Okay, so parallelism, you can't parallelize branches
00:08:24.660 | or you can.
00:08:25.500 | - No, you can predict--
00:08:26.900 | - What does predict a branch mean?
00:08:28.420 | What does predict--
00:08:29.260 | - So imagine you do a computation over and over,
00:08:31.380 | you're in a loop.
00:08:32.740 | So while n is greater than one, do.
00:08:37.220 | And you go through that loop a million times.
00:08:39.020 | So every time you look at the branch,
00:08:40.440 | you say, it's probably still greater than one.
00:08:43.540 | - And you're saying you could do that accurately.
00:08:45.640 | - Very accurately.
00:08:46.480 | Modern computers-- - My mind is blown.
00:08:48.020 | How the heck do you do that?
00:08:49.260 | Wait a minute.
00:08:50.420 | - Well, you want to know?
00:08:51.620 | This is really sad.
00:08:53.300 | 20 years ago, you simply recorded
00:08:56.500 | which way the branch went last time
00:08:58.420 | and predicted the same thing.
00:09:00.580 | - Right. - Okay.
00:09:02.140 | - What's the accuracy of that?
00:09:03.940 | - 85%.
00:09:05.900 | So then somebody said, hey, let's keep a couple of bits
00:09:09.580 | and have a little counter.
00:09:10.860 | So when it predicts one way, we count up and then pins.
00:09:14.500 | So say you have a three-bit counter,
00:09:15.840 | so you count up and then you count down.
00:09:18.540 | And if it's, you can use the top bit as the sign bit.
00:09:21.060 | So you have a sign two-bit number.
00:09:22.820 | So if it's greater than one, you predict taken
00:09:25.260 | and less than one, you predict not taken, right?
00:09:29.260 | Or less than zero, whatever the thing is.
00:09:31.880 | And that got us to 92%.
00:09:33.940 | - Oh.
00:09:35.100 | - Okay, now it gets better.
00:09:37.340 | This branch depends on how you got there.
00:09:40.680 | So if you came down the code one way,
00:09:43.320 | you're talking about Bob and Jane, right?
00:09:46.180 | And then said, does Bob like Jane?
00:09:48.240 | It went one way, but if you're talking about Bob and Jill,
00:09:50.700 | does Bob like Jane?
00:09:51.740 | You go a different way, right?
00:09:53.580 | So that's called history.
00:09:54.700 | So you take the history and a counter.
00:09:56.700 | That's cool, but that's not how anything works today.
00:10:01.160 | They use something that looks a little like
00:10:03.060 | a neural network.
00:10:04.060 | So modern, you take all the execution flows
00:10:09.020 | and then you do basically deep pattern recognition
00:10:13.900 | of how the program is executing.
00:10:16.300 | And you do that multiple different ways
00:10:21.500 | and you have something that chooses what the best result is.
00:10:24.500 | There's a little supercomputer inside the computer.
00:10:28.220 | - That's trying to predict branching.
00:10:29.060 | - That calculates which way branches go.
00:10:32.120 | So the effective window that is worth finding grass
00:10:35.080 | and gets bigger.
00:10:36.100 | - Why was that gonna make me sad?
00:10:39.640 | 'Cause that's amazing.
00:10:40.680 | - It's amazingly complicated.
00:10:42.200 | - Oh, well.
00:10:43.040 | - Well, here's the funny thing.
00:10:44.840 | So to get to 85% took a thousand bits.
00:10:49.500 | To get to 99% takes tens of megabits.
00:10:56.640 | So this is one of those.
00:10:58.900 | To get the result, to get from a window of say
00:11:02.860 | 50 instructions to 500, it took three orders of magnitude
00:11:07.300 | or four orders of magnitude more bits.
00:11:09.380 | - Now if you get the prediction of a branch wrong,
00:11:13.280 | what happens then?
00:11:14.120 | - You flush the pipe.
00:11:15.180 | - You flush the pipe, so it's just the performance cost.
00:11:17.380 | - But it gets even better.
00:11:19.260 | So we're starting to look at stuff that says,
00:11:21.660 | so they executed down this path
00:11:24.500 | and then you had two ways to go,
00:11:27.060 | but far, far away there's something that doesn't matter
00:11:30.300 | which path you went.
00:11:31.460 | So you took the wrong path, you executed a bunch of stuff.
00:11:37.080 | Then you had the mispredicting, you backed it up,
00:11:40.220 | but you remembered all the results you already calculated.
00:11:43.300 | Some of those are just fine.
00:11:45.460 | Like if you read a book and you misunderstand a paragraph,
00:11:48.060 | you're understanding the next paragraph.
00:11:50.300 | Sometimes it's invariant to their understanding.
00:11:53.540 | Sometimes it depends on it.
00:11:55.380 | - And you can kind of anticipate that invariance.
00:12:01.060 | - Yeah, well, you can keep track of whether
00:12:03.740 | the data changed and so when you come back
00:12:06.100 | to a piece of code, should you calculate it again
00:12:08.180 | or do the same thing?
00:12:09.660 | - Okay, how much of this is art
00:12:11.140 | and how much of it is science?
00:12:13.420 | 'Cause it sounds pretty complicated.
00:12:16.900 | - Well, how do you describe a situation?
00:12:18.420 | So imagine you come to a point in the road
00:12:20.380 | where you have to make a decision.
00:12:22.940 | And you have a bunch of knowledge about which way to go.
00:12:24.860 | Maybe you have a map.
00:12:26.700 | So you wanna go the shortest way
00:12:29.340 | or do you wanna go the fastest way
00:12:30.940 | or you wanna take the nicest road.
00:12:32.580 | So there's some set of data.
00:12:35.620 | So imagine you're doing something complicated
00:12:37.420 | like building a computer.
00:12:39.580 | And there's hundreds of decision points
00:12:42.100 | all with hundreds of possible ways to go.
00:12:45.500 | And the ways you pick interact in a complicated way.
00:12:51.220 | And then you have to pick the right spot.
00:12:53.460 | - Right, so that's-- - Or is that art or science?
00:12:54.700 | I don't know.
00:12:55.540 | - You avoided the question.
00:12:56.700 | You just described the Robert Frost problem
00:12:59.140 | of road less taken.
00:13:00.380 | - Described the Robert Frost problem?
00:13:03.500 | (laughing)
00:13:05.380 | That's what we do as computer designers.
00:13:07.220 | It's all poetry.
00:13:08.180 | - Okay. - Great.
00:13:09.220 | Yeah, I don't know how to describe that
00:13:11.980 | because some people are very good
00:13:14.180 | at making those intuitive leaps.
00:13:15.700 | It seems like this combination of things.
00:13:18.340 | Some people are less good at it
00:13:19.980 | but they're really good at evaluating the alternatives.
00:13:23.380 | Right, and everybody has a different way to do it.
00:13:27.060 | And some people can't make those leaps
00:13:29.660 | but they're really good at analyzing it.
00:13:32.100 | So when you see computers are designed by teams of people
00:13:34.660 | who have very different skill sets
00:13:37.060 | and a good team has lots of different kinds of people.
00:13:42.060 | I suspect you would describe some of them as artistic.
00:13:44.980 | - Right. - But not very many.
00:13:48.220 | - Unfortunately or fortunately.
00:13:49.860 | - Unfortunately.
00:13:51.500 | Well, you know, computer design's hard.
00:13:54.260 | It's 99% perspiration.
00:13:57.260 | The 1% inspiration is really important.
00:14:01.100 | - But you still need the 99.
00:14:03.660 | - Yeah, you gotta do a lot of work.
00:14:05.140 | And then there are interesting things to do
00:14:08.580 | at every level of that stack.
00:14:10.580 | - At the end of the day,
00:14:13.500 | if you run the same program multiple times,
00:14:16.680 | does it always produce the same result?
00:14:19.260 | Is there some room for fuzziness there?
00:14:22.540 | - That's a math problem.
00:14:24.540 | So if you run a correct C program,
00:14:26.380 | the definition is every time you run it,
00:14:29.280 | you get the same answer.
00:14:30.260 | - Yeah, well, that's a math statement.
00:14:32.300 | - Well, that's a language definitional statement.
00:14:35.260 | So for years when we first did 3D acceleration of graphics,
00:14:40.260 | you could run the same scene multiple times
00:14:45.100 | and get different answers.
00:14:46.580 | - Right. - Right.
00:14:48.020 | And then some people thought that was okay
00:14:50.140 | and some people thought it was a bad idea.
00:14:52.340 | And then when the HPC world used GPUs for calculations,
00:14:57.060 | they thought it was a really bad idea, okay?
00:14:59.940 | Now, in modern AI stuff,
00:15:02.220 | people are looking at networks
00:15:05.940 | where the precision of the data is low enough
00:15:08.860 | that the data is somewhat noisy.
00:15:11.460 | And the observation is the input data is unbelievably noisy.
00:15:15.080 | So why should the calculation be not noisy?
00:15:18.020 | And people have experimented with algorithms
00:15:20.000 | that say can get faster answers by being noisy.
00:15:23.740 | Like as a network starts to converge,
00:15:26.060 | if you look at the computation graph,
00:15:27.380 | it starts out really wide and then it gets narrower.
00:15:29.940 | And you can say, is that last little bit that important?
00:15:32.260 | Or should I start the graph on the next rev
00:15:35.500 | before we whittle it all the way down to the answer?
00:15:38.560 | Right, so you can create algorithms that are noisy.
00:15:41.840 | Now, if you're developing something
00:15:43.240 | and every time you run it, you get a different answer,
00:15:45.240 | it's really annoying.
00:15:47.120 | And so most people think even today,
00:15:51.740 | every time you run the program, you get the same answer.
00:15:54.540 | - Now I know, but the question is,
00:15:56.160 | that's the formal definition of a programming language.
00:16:00.200 | - There is a definition of languages
00:16:02.320 | that don't get the same answer, but people who use those.
00:16:05.180 | You always want something 'cause you get a bad answer
00:16:08.560 | and then you're wondering, is it because
00:16:11.040 | of something in the algorithm or because of this?
00:16:13.160 | And so everybody wants a little switch that says,
00:16:14.920 | no matter what, do it deterministically.
00:16:18.080 | And it's really weird 'cause almost everything
00:16:20.200 | going into modern calculations is noisy.
00:16:23.120 | So why do the answers have to be so clear?
00:16:26.040 | - Right, so where do you stand?
00:16:27.400 | - I design computers for people who run programs.
00:16:30.280 | So if somebody says, I want a deterministic answer,
00:16:34.680 | like most people want that.
00:16:36.200 | - Can you deliver a deterministic answer,
00:16:38.000 | I guess is the question.
00:16:39.240 | Like when you--
00:16:40.080 | - Yeah, hopefully, sure.
00:16:41.880 | What people don't realize is you get a deterministic answer
00:16:45.080 | even though the execution flow is very undeterministic.
00:16:48.880 | So you run this program 100 times,
00:16:50.880 | it never runs the same way twice, ever.
00:16:53.880 | - And the answer, it arrives at the same answer.
00:16:55.760 | - But it gets the same answer every time.
00:16:57.000 | - It's just amazing.
00:16:59.440 | (upbeat music)
00:17:02.040 | (upbeat music)
00:17:04.640 | (upbeat music)
00:17:07.240 | (upbeat music)
00:17:09.840 | (upbeat music)
00:17:12.440 | (upbeat music)
00:17:15.040 | [BLANK_AUDIO]