Jim Keller: Abstraction Layers from the Atom to the Data Center

00:00:00.000 | - So let's get into the basics before we zoom back out.

00:00:05.000 | How do you build a computer from scratch?

00:00:08.820 | What is a microprocessor?

00:00:10.580 | What is a microarchitecture?

00:00:11.940 | What's an instruction set architecture?

00:00:14.460 | Maybe even as far back as what is a transistor?

00:00:17.280 | - So the special charm of computer engineering

00:00:22.840 | is there's a relatively good understanding

00:00:26.220 | of abstraction layers.

00:00:28.260 | So down at the bottom you have atoms,

00:00:30.080 | and atoms get put together in materials like silicon

00:00:33.280 | or dope silicon or metal, and we build transistors.

00:00:37.240 | On top of that, we build logic gates, right?

00:00:41.500 | And then functional units, like an adder, a subtractor,

00:00:45.240 | an instruction parsing unit, and then we assemble those

00:00:47.700 | into processing elements.

00:00:50.120 | Modern computers are built out of probably 10 to 20

00:00:55.040 | locally organic processing elements

00:00:58.760 | or coherent processing elements,

00:01:00.440 | and then that runs computer programs, right?

00:01:04.440 | So there's abstraction layers, and then software,

00:01:07.560 | there's an instruction set you run,

00:01:09.560 | and then there's assembly language C, C++, Java, JavaScript.

00:01:14.240 | There's abstraction layers, essentially from the atom

00:01:17.760 | to the data center, right?

00:01:20.320 | So when you build a computer,

00:01:24.520 | first there's a target, like what's it for?

00:01:26.320 | Like how fast does it have to be?

00:01:27.740 | Which today there's a whole bunch of metrics

00:01:30.000 | about what that is.

00:01:31.620 | And then in an organization of 1,000 people

00:01:34.840 | who build a computer, there's lots of different disciplines

00:01:39.840 | that you have to operate on.

00:01:41.920 | Does that make sense?

00:01:43.280 | And so--

00:01:44.120 | - So there's a bunch of levels of abstraction.

00:01:47.380 | In an organization like Intel, and in your own vision,

00:01:53.560 | there's a lot of brilliance that comes in

00:01:55.400 | at every one of those layers.

00:01:57.500 | Some of it is science, some of it is engineering,

00:01:59.480 | some of it is art.

00:02:01.160 | What's the most, if you could pick favorites,

00:02:04.160 | what's the most important, your favorite layer

00:02:07.080 | on these layers of abstractions?

00:02:08.900 | Where does the magic enter this hierarchy?

00:02:11.740 | - I don't really care.

00:02:14.920 | That's the fun, you know, I'm somewhat agnostic to that.

00:02:18.560 | So I would say for relatively long periods of time,

00:02:23.320 | instruction sets are stable.

00:02:25.840 | So the x86 instruction set, the ARM instruction set.

00:02:29.800 | - What's an instruction set?

00:02:31.160 | - So it says, how do you encode the basic operations?

00:02:33.920 | Load, store, multiply, add, subtract, conditional branch.

00:02:37.420 | There aren't that many interesting instructions.

00:02:41.600 | Like if you look at a program and it runs,

00:02:44.240 | 90% of the execution is on 25 opcodes, 25 instructions.

00:02:49.240 | And those are stable, right?

00:02:51.720 | - What does it mean, stable?

00:02:53.280 | - Intel architecture has been around for 25 years.

00:02:55.960 | - It works.

00:02:56.800 | - It works.

00:02:57.620 | And that's because the basics, you know,

00:03:00.320 | are defined a long time ago, right?

00:03:03.100 | Now, the way an old computer ran is you fetched instructions

00:03:08.040 | and you executed them in order.

00:03:10.800 | Do the load, do the add, do the compare.

00:03:13.960 | The way a modern computer works is you fetch

00:03:17.600 | large numbers of instructions, say 500.

00:03:22.000 | And then you find the dependency graph

00:03:24.040 | between the instructions.

00:03:25.720 | And then you execute in independent units,

00:03:30.100 | those little micrographs.

00:03:32.200 | So a modern computer, like people like to say,

00:03:35.560 | computers should be simple and clean.

00:03:38.520 | But this turns out the market for a simple,

00:03:40.180 | complete, clean, slow computers is zero, right?

00:03:44.040 | We don't sell any simple, clean computers.

00:03:47.360 | Now you can, there's how you build it can be clean,

00:03:51.320 | but the computer people want to buy,

00:03:54.480 | that's say in a phone or a data center,

00:03:58.200 | fetches a large number of instructions,

00:04:00.440 | computes the dependency graph,

00:04:03.360 | and then executes it in a way that gets the right answers.

00:04:06.920 | - And optimize that graph somehow.

00:04:08.640 | - Yeah, they run deeply out of order.

00:04:11.280 | And then there's semantics around how memory ordering works

00:04:15.320 | and other things work.

00:04:16.160 | So the computer sort of has a bunch of bookkeeping tables

00:04:19.720 | that says what order do these operations finish in

00:04:23.280 | or appear to finish in.

00:04:25.560 | But to go fast, you have to fetch a lot of instructions

00:04:28.480 | and find all the parallelism.

00:04:30.480 | Now there's a second kind of computer,

00:04:33.240 | which we call GPUs today.

00:04:35.340 | And I call it the difference.

00:04:37.400 | There's found parallelism,

00:04:38.760 | like you have a program with a lot of dependent instructions.

00:04:41.880 | You fetch a bunch and then you go figure out

00:04:43.880 | the dependency graph and you issue instructions out of order.

00:04:47.200 | That's because you have one serial narrative to execute,

00:04:50.760 | which in fact can be done out of order.

00:04:53.640 | - You call it a narrative?

00:04:54.880 | - Yeah.

00:04:55.720 | - Wow.

00:04:56.560 | - So yeah, so humans think in serial narrative.

00:04:58.480 | So read a book, right?

00:05:00.760 | There's a sentence after sentence after sentence

00:05:03.560 | and there's paragraphs.

00:05:04.640 | Now you could diagram that.

00:05:07.160 | Imagine you diagrammed it properly and you said,

00:05:09.640 | which sentences could be read in any order

00:05:14.040 | without changing the meaning, right?

00:05:17.780 | - That's a fascinating question to ask of a book.

00:05:19.980 | Yeah.

00:05:20.820 | - Yeah, you could do that, right?

00:05:22.180 | So some paragraphs could be reordered,

00:05:24.100 | some sentences can be reordered.

00:05:26.220 | You could say, he is tall and smart and X, right?

00:05:31.220 | And it doesn't matter the order of tall and smart.

00:05:36.020 | But if you say the tall man is wearing a red shirt,

00:05:40.720 | what colors, you know, like you can create dependencies.

00:05:44.860 | Right?

00:05:46.020 | Right, and so GPUs on the other hand

00:05:49.820 | run simple programs on pixels,

00:05:53.100 | but you're given a million of them.

00:05:54.660 | And the first order, the screen you're looking at

00:05:57.940 | doesn't care which order you do it in.

00:05:59.980 | So I call that given parallelism.

00:06:02.260 | Simple narratives around the large numbers of things

00:06:06.080 | where you can just say it's parallel

00:06:08.220 | because you told me it was.

00:06:10.100 | - So found parallelism where the narrative is sequential

00:06:15.480 | but you discover like little pockets of parallelism

00:06:18.900 | versus--

00:06:19.740 | - Turns out large pockets of parallelism.

00:06:21.780 | - Large, so how hard is it to discover?

00:06:23.660 | - Well, how hard is it?

00:06:24.740 | That's just transistor count, right?

00:06:26.620 | So once you crack the problem, you say,

00:06:28.940 | here's how you fetch 10 instructions at a time,

00:06:31.260 | here's how you calculated the dependencies between them,

00:06:34.140 | here's how you describe the dependencies,

00:06:36.300 | here's, you know, these are pieces, right?

00:06:38.460 | - So once you describe the dependencies,

00:06:43.360 | then it's just a graph.

00:06:45.380 | Sort of it's an algorithm that finds,

00:06:47.980 | what is that?

00:06:49.740 | I'm sure there's a graph theory,

00:06:51.380 | theoretical answer here that's solvable.

00:06:53.660 | In general, programs, modern programs

00:06:58.500 | that human beings write,

00:07:00.020 | how much found parallelism is there in them?

00:07:02.980 | - About 10x.

00:07:03.820 | - What does 10x mean?

00:07:04.660 | - So if you execute it in order.

00:07:07.500 | - Versus, yeah.

00:07:09.300 | - You would get what's called cycles per instruction

00:07:11.700 | and it would be about, you know, three instructions,

00:07:15.980 | three cycles per instruction

00:07:17.780 | because of the latency of the operations and stuff.

00:07:20.540 | And in a modern computer,

00:07:22.260 | executes it like 0.25 cycles per instruction.

00:07:26.480 | So it's about, we today find 10x.

00:07:29.580 | And there's two things.

00:07:30.780 | One is the found parallelism in the narrative, right?

00:07:35.140 | And the other is to predictability of the narrative, right?

00:07:39.140 | So certain operations, they do a bunch of calculations

00:07:43.280 | and if greater than one, do this, else do that.

00:07:46.640 | That decision is predicted in modern computers

00:07:50.940 | to high 90% accuracy.

00:07:53.980 | So branches happen a lot.

00:07:56.500 | So imagine you have a decision

00:07:58.180 | to make every six instructions,

00:07:59.540 | which is about the average, right?

00:08:01.540 | But you want to fetch 500 instructions,

00:08:03.240 | figure out the graph and execute them all in parallel.

00:08:06.220 | That means you have, let's say,

00:08:09.380 | if you fix 600 instructions and it's every six,

00:08:12.780 | you have to fetch,

00:08:13.860 | you have to predict 99 out of a hundred branches correctly

00:08:17.180 | for that window to be effective.

00:08:20.140 | - Okay, so parallelism, you can't parallelize branches

00:08:24.660 | or you can.

00:08:25.500 | - No, you can predict--

00:08:26.900 | - What does predict a branch mean?

00:08:28.420 | What does predict--

00:08:29.260 | - So imagine you do a computation over and over,

00:08:31.380 | you're in a loop.

00:08:32.740 | So while n is greater than one, do.

00:08:37.220 | And you go through that loop a million times.

00:08:39.020 | So every time you look at the branch,

00:08:40.440 | you say, it's probably still greater than one.

00:08:43.540 | - And you're saying you could do that accurately.

00:08:45.640 | - Very accurately.

00:08:46.480 | Modern computers-- - My mind is blown.

00:08:48.020 | How the heck do you do that?

00:08:49.260 | Wait a minute.

00:08:50.420 | - Well, you want to know?

00:08:51.620 | This is really sad.

00:08:53.300 | 20 years ago, you simply recorded

00:08:56.500 | which way the branch went last time

00:08:58.420 | and predicted the same thing.

00:09:00.580 | - Right. - Okay.

00:09:02.140 | - What's the accuracy of that?

00:09:03.940 | - 85%.

00:09:05.900 | So then somebody said, hey, let's keep a couple of bits

00:09:09.580 | and have a little counter.

00:09:10.860 | So when it predicts one way, we count up and then pins.

00:09:14.500 | So say you have a three-bit counter,

00:09:15.840 | so you count up and then you count down.

00:09:18.540 | And if it's, you can use the top bit as the sign bit.

00:09:21.060 | So you have a sign two-bit number.

00:09:22.820 | So if it's greater than one, you predict taken

00:09:25.260 | and less than one, you predict not taken, right?

00:09:29.260 | Or less than zero, whatever the thing is.

00:09:31.880 | And that got us to 92%.

00:09:33.940 | - Oh.

00:09:35.100 | - Okay, now it gets better.

00:09:37.340 | This branch depends on how you got there.

00:09:40.680 | So if you came down the code one way,

00:09:43.320 | you're talking about Bob and Jane, right?

00:09:46.180 | And then said, does Bob like Jane?

00:09:48.240 | It went one way, but if you're talking about Bob and Jill,

00:09:50.700 | does Bob like Jane?

00:09:51.740 | You go a different way, right?

00:09:53.580 | So that's called history.

00:09:54.700 | So you take the history and a counter.

00:09:56.700 | That's cool, but that's not how anything works today.

00:10:01.160 | They use something that looks a little like

00:10:03.060 | a neural network.

00:10:04.060 | So modern, you take all the execution flows

00:10:09.020 | and then you do basically deep pattern recognition

00:10:13.900 | of how the program is executing.

00:10:16.300 | And you do that multiple different ways

00:10:21.500 | and you have something that chooses what the best result is.

00:10:24.500 | There's a little supercomputer inside the computer.

00:10:28.220 | - That's trying to predict branching.

00:10:29.060 | - That calculates which way branches go.

00:10:32.120 | So the effective window that is worth finding grass

00:10:35.080 | and gets bigger.

00:10:36.100 | - Why was that gonna make me sad?

00:10:39.640 | 'Cause that's amazing.

00:10:40.680 | - It's amazingly complicated.

00:10:42.200 | - Oh, well.

00:10:43.040 | - Well, here's the funny thing.

00:10:44.840 | So to get to 85% took a thousand bits.

00:10:49.500 | To get to 99% takes tens of megabits.

00:10:56.640 | So this is one of those.

00:10:58.900 | To get the result, to get from a window of say

00:11:02.860 | 50 instructions to 500, it took three orders of magnitude

00:11:07.300 | or four orders of magnitude more bits.

00:11:09.380 | - Now if you get the prediction of a branch wrong,

00:11:13.280 | what happens then?

00:11:14.120 | - You flush the pipe.

00:11:15.180 | - You flush the pipe, so it's just the performance cost.

00:11:17.380 | - But it gets even better.

00:11:19.260 | So we're starting to look at stuff that says,

00:11:21.660 | so they executed down this path

00:11:24.500 | and then you had two ways to go,

00:11:27.060 | but far, far away there's something that doesn't matter

00:11:30.300 | which path you went.

00:11:31.460 | So you took the wrong path, you executed a bunch of stuff.

00:11:37.080 | Then you had the mispredicting, you backed it up,

00:11:40.220 | but you remembered all the results you already calculated.

00:11:43.300 | Some of those are just fine.

00:11:45.460 | Like if you read a book and you misunderstand a paragraph,

00:11:48.060 | you're understanding the next paragraph.

00:11:50.300 | Sometimes it's invariant to their understanding.

00:11:53.540 | Sometimes it depends on it.

00:11:55.380 | - And you can kind of anticipate that invariance.

00:12:01.060 | - Yeah, well, you can keep track of whether

00:12:03.740 | the data changed and so when you come back

00:12:06.100 | to a piece of code, should you calculate it again

00:12:08.180 | or do the same thing?

00:12:09.660 | - Okay, how much of this is art

00:12:11.140 | and how much of it is science?

00:12:13.420 | 'Cause it sounds pretty complicated.

00:12:16.900 | - Well, how do you describe a situation?

00:12:18.420 | So imagine you come to a point in the road

00:12:20.380 | where you have to make a decision.

00:12:22.940 | And you have a bunch of knowledge about which way to go.

00:12:24.860 | Maybe you have a map.

00:12:26.700 | So you wanna go the shortest way

00:12:29.340 | or do you wanna go the fastest way

00:12:30.940 | or you wanna take the nicest road.

00:12:32.580 | So there's some set of data.

00:12:35.620 | So imagine you're doing something complicated

00:12:37.420 | like building a computer.

00:12:39.580 | And there's hundreds of decision points

00:12:42.100 | all with hundreds of possible ways to go.

00:12:45.500 | And the ways you pick interact in a complicated way.

00:12:51.220 | And then you have to pick the right spot.

00:12:53.460 | - Right, so that's-- - Or is that art or science?

00:12:54.700 | I don't know.

00:12:55.540 | - You avoided the question.

00:12:56.700 | You just described the Robert Frost problem

00:12:59.140 | of road less taken.

00:13:00.380 | - Described the Robert Frost problem?

00:13:03.500 | (laughing)

00:13:05.380 | That's what we do as computer designers.

00:13:07.220 | It's all poetry.

00:13:08.180 | - Okay. - Great.

00:13:09.220 | Yeah, I don't know how to describe that

00:13:11.980 | because some people are very good

00:13:14.180 | at making those intuitive leaps.

00:13:15.700 | It seems like this combination of things.

00:13:18.340 | Some people are less good at it

00:13:19.980 | but they're really good at evaluating the alternatives.

00:13:23.380 | Right, and everybody has a different way to do it.

00:13:27.060 | And some people can't make those leaps

00:13:29.660 | but they're really good at analyzing it.

00:13:32.100 | So when you see computers are designed by teams of people

00:13:34.660 | who have very different skill sets

00:13:37.060 | and a good team has lots of different kinds of people.

00:13:42.060 | I suspect you would describe some of them as artistic.

00:13:44.980 | - Right. - But not very many.

00:13:48.220 | - Unfortunately or fortunately.

00:13:49.860 | - Unfortunately.

00:13:51.500 | Well, you know, computer design's hard.

00:13:54.260 | It's 99% perspiration.

00:13:57.260 | The 1% inspiration is really important.

00:14:01.100 | - But you still need the 99.

00:14:03.660 | - Yeah, you gotta do a lot of work.

00:14:05.140 | And then there are interesting things to do

00:14:08.580 | at every level of that stack.

00:14:10.580 | - At the end of the day,

00:14:13.500 | if you run the same program multiple times,

00:14:16.680 | does it always produce the same result?

00:14:19.260 | Is there some room for fuzziness there?

00:14:22.540 | - That's a math problem.

00:14:24.540 | So if you run a correct C program,

00:14:26.380 | the definition is every time you run it,

00:14:29.280 | you get the same answer.

00:14:30.260 | - Yeah, well, that's a math statement.

00:14:32.300 | - Well, that's a language definitional statement.

00:14:35.260 | So for years when we first did 3D acceleration of graphics,

00:14:40.260 | you could run the same scene multiple times

00:14:45.100 | and get different answers.

00:14:46.580 | - Right. - Right.

00:14:48.020 | And then some people thought that was okay

00:14:50.140 | and some people thought it was a bad idea.

00:14:52.340 | And then when the HPC world used GPUs for calculations,

00:14:57.060 | they thought it was a really bad idea, okay?

00:14:59.940 | Now, in modern AI stuff,

00:15:02.220 | people are looking at networks

00:15:05.940 | where the precision of the data is low enough

00:15:08.860 | that the data is somewhat noisy.

00:15:11.460 | And the observation is the input data is unbelievably noisy.

00:15:15.080 | So why should the calculation be not noisy?

00:15:18.020 | And people have experimented with algorithms

00:15:20.000 | that say can get faster answers by being noisy.

00:15:23.740 | Like as a network starts to converge,

00:15:26.060 | if you look at the computation graph,

00:15:27.380 | it starts out really wide and then it gets narrower.

00:15:29.940 | And you can say, is that last little bit that important?

00:15:32.260 | Or should I start the graph on the next rev

00:15:35.500 | before we whittle it all the way down to the answer?

00:15:38.560 | Right, so you can create algorithms that are noisy.

00:15:41.840 | Now, if you're developing something

00:15:43.240 | and every time you run it, you get a different answer,

00:15:45.240 | it's really annoying.

00:15:47.120 | And so most people think even today,

00:15:51.740 | every time you run the program, you get the same answer.

00:15:54.540 | - Now I know, but the question is,

00:15:56.160 | that's the formal definition of a programming language.

00:16:00.200 | - There is a definition of languages

00:16:02.320 | that don't get the same answer, but people who use those.

00:16:05.180 | You always want something 'cause you get a bad answer

00:16:08.560 | and then you're wondering, is it because

00:16:11.040 | of something in the algorithm or because of this?

00:16:13.160 | And so everybody wants a little switch that says,

00:16:14.920 | no matter what, do it deterministically.

00:16:18.080 | And it's really weird 'cause almost everything

00:16:20.200 | going into modern calculations is noisy.

00:16:23.120 | So why do the answers have to be so clear?

00:16:26.040 | - Right, so where do you stand?

00:16:27.400 | - I design computers for people who run programs.

00:16:30.280 | So if somebody says, I want a deterministic answer,

00:16:34.680 | like most people want that.

00:16:36.200 | - Can you deliver a deterministic answer,

00:16:38.000 | I guess is the question.

00:16:39.240 | Like when you--

00:16:40.080 | - Yeah, hopefully, sure.

00:16:41.880 | What people don't realize is you get a deterministic answer

00:16:45.080 | even though the execution flow is very undeterministic.

00:16:48.880 | So you run this program 100 times,

00:16:50.880 | it never runs the same way twice, ever.

00:16:53.880 | - And the answer, it arrives at the same answer.

00:16:55.760 | - But it gets the same answer every time.

00:16:57.000 | - It's just amazing.

00:16:59.440 | (upbeat music)

00:17:02.040 | (upbeat music)

00:17:04.640 | (upbeat music)

00:17:07.240 | (upbeat music)

00:17:09.840 | (upbeat music)

00:17:12.440 | (upbeat music)

00:17:15.040 | [BLANK_AUDIO]

Jim Keller: Abstraction Layers from the Atom to the Data Center | AI Podcast Clips