Jim Keller: Abstraction Layers from the Atom to the Data Center

- So let's get into the basics before we zoom back out. How do you build a computer from scratch? What is a microprocessor? What is a microarchitecture? What's an instruction set architecture? Maybe even as far back as what is a transistor? - So the special charm of computer engineering is there's a relatively good understanding of abstraction layers.

So down at the bottom you have atoms, and atoms get put together in materials like silicon or dope silicon or metal, and we build transistors. On top of that, we build logic gates, right? And then functional units, like an adder, a subtractor, an instruction parsing unit, and then we assemble those into processing elements.

Modern computers are built out of probably 10 to 20 locally organic processing elements or coherent processing elements, and then that runs computer programs, right? So there's abstraction layers, and then software, there's an instruction set you run, and then there's assembly language C, C++, Java, JavaScript. There's abstraction layers, essentially from the atom to the data center, right?

So when you build a computer, first there's a target, like what's it for? Like how fast does it have to be? Which today there's a whole bunch of metrics about what that is. And then in an organization of 1,000 people who build a computer, there's lots of different disciplines that you have to operate on.

Does that make sense? And so-- - So there's a bunch of levels of abstraction. In an organization like Intel, and in your own vision, there's a lot of brilliance that comes in at every one of those layers. Some of it is science, some of it is engineering, some of it is art.

What's the most, if you could pick favorites, what's the most important, your favorite layer on these layers of abstractions? Where does the magic enter this hierarchy? - I don't really care. That's the fun, you know, I'm somewhat agnostic to that. So I would say for relatively long periods of time, instruction sets are stable.

So the x86 instruction set, the ARM instruction set. - What's an instruction set? - So it says, how do you encode the basic operations? Load, store, multiply, add, subtract, conditional branch. There aren't that many interesting instructions. Like if you look at a program and it runs, 90% of the execution is on 25 opcodes, 25 instructions.

And those are stable, right? - What does it mean, stable? - Intel architecture has been around for 25 years. - It works. - It works. And that's because the basics, you know, are defined a long time ago, right? Now, the way an old computer ran is you fetched instructions and you executed them in order.

Do the load, do the add, do the compare. The way a modern computer works is you fetch large numbers of instructions, say 500. And then you find the dependency graph between the instructions. And then you execute in independent units, those little micrographs. So a modern computer, like people like to say, computers should be simple and clean.

But this turns out the market for a simple, complete, clean, slow computers is zero, right? We don't sell any simple, clean computers. Now you can, there's how you build it can be clean, but the computer people want to buy, that's say in a phone or a data center, fetches a large number of instructions, computes the dependency graph, and then executes it in a way that gets the right answers.

- And optimize that graph somehow. - Yeah, they run deeply out of order. And then there's semantics around how memory ordering works and other things work. So the computer sort of has a bunch of bookkeeping tables that says what order do these operations finish in or appear to finish in.

But to go fast, you have to fetch a lot of instructions and find all the parallelism. Now there's a second kind of computer, which we call GPUs today. And I call it the difference. There's found parallelism, like you have a program with a lot of dependent instructions. You fetch a bunch and then you go figure out the dependency graph and you issue instructions out of order.

That's because you have one serial narrative to execute, which in fact can be done out of order. - You call it a narrative? - Yeah. - Wow. - So yeah, so humans think in serial narrative. So read a book, right? There's a sentence after sentence after sentence and there's paragraphs.

Now you could diagram that. Imagine you diagrammed it properly and you said, which sentences could be read in any order without changing the meaning, right? - That's a fascinating question to ask of a book. Yeah. - Yeah, you could do that, right? So some paragraphs could be reordered, some sentences can be reordered.

You could say, he is tall and smart and X, right? And it doesn't matter the order of tall and smart. But if you say the tall man is wearing a red shirt, what colors, you know, like you can create dependencies. Right? Right, and so GPUs on the other hand run simple programs on pixels, but you're given a million of them.

And the first order, the screen you're looking at doesn't care which order you do it in. So I call that given parallelism. Simple narratives around the large numbers of things where you can just say it's parallel because you told me it was. - So found parallelism where the narrative is sequential but you discover like little pockets of parallelism versus-- - Turns out large pockets of parallelism.

- Large, so how hard is it to discover? - Well, how hard is it? That's just transistor count, right? So once you crack the problem, you say, here's how you fetch 10 instructions at a time, here's how you calculated the dependencies between them, here's how you describe the dependencies, here's, you know, these are pieces, right?

- So once you describe the dependencies, then it's just a graph. Sort of it's an algorithm that finds, what is that? I'm sure there's a graph theory, theoretical answer here that's solvable. In general, programs, modern programs that human beings write, how much found parallelism is there in them? - About 10x.

- What does 10x mean? - So if you execute it in order. - Versus, yeah. - You would get what's called cycles per instruction and it would be about, you know, three instructions, three cycles per instruction because of the latency of the operations and stuff. And in a modern computer, executes it like 0.25 cycles per instruction.

So it's about, we today find 10x. And there's two things. One is the found parallelism in the narrative, right? And the other is to predictability of the narrative, right? So certain operations, they do a bunch of calculations and if greater than one, do this, else do that. That decision is predicted in modern computers to high 90% accuracy.

So branches happen a lot. So imagine you have a decision to make every six instructions, which is about the average, right? But you want to fetch 500 instructions, figure out the graph and execute them all in parallel. That means you have, let's say, if you fix 600 instructions and it's every six, you have to fetch, you have to predict 99 out of a hundred branches correctly for that window to be effective.

- Okay, so parallelism, you can't parallelize branches or you can. - No, you can predict-- - What does predict a branch mean? What does predict-- - So imagine you do a computation over and over, you're in a loop. So while n is greater than one, do. And you go through that loop a million times.

So every time you look at the branch, you say, it's probably still greater than one. - And you're saying you could do that accurately. - Very accurately. Modern computers-- - My mind is blown. How the heck do you do that? Wait a minute. - Well, you want to know?

This is really sad. 20 years ago, you simply recorded which way the branch went last time and predicted the same thing. - Right. - Okay. - What's the accuracy of that? - 85%. So then somebody said, hey, let's keep a couple of bits and have a little counter. So when it predicts one way, we count up and then pins.

So say you have a three-bit counter, so you count up and then you count down. And if it's, you can use the top bit as the sign bit. So you have a sign two-bit number. So if it's greater than one, you predict taken and less than one, you predict not taken, right?

Or less than zero, whatever the thing is. And that got us to 92%. - Oh. - Okay, now it gets better. This branch depends on how you got there. So if you came down the code one way, you're talking about Bob and Jane, right? And then said, does Bob like Jane?

It went one way, but if you're talking about Bob and Jill, does Bob like Jane? You go a different way, right? So that's called history. So you take the history and a counter. That's cool, but that's not how anything works today. They use something that looks a little like a neural network.

So modern, you take all the execution flows and then you do basically deep pattern recognition of how the program is executing. And you do that multiple different ways and you have something that chooses what the best result is. There's a little supercomputer inside the computer. - That's trying to predict branching.

- That calculates which way branches go. So the effective window that is worth finding grass and gets bigger. - Why was that gonna make me sad? 'Cause that's amazing. - It's amazingly complicated. - Oh, well. - Well, here's the funny thing. So to get to 85% took a thousand bits.

To get to 99% takes tens of megabits. So this is one of those. To get the result, to get from a window of say 50 instructions to 500, it took three orders of magnitude or four orders of magnitude more bits. - Now if you get the prediction of a branch wrong, what happens then?

- You flush the pipe. - You flush the pipe, so it's just the performance cost. - But it gets even better. So we're starting to look at stuff that says, so they executed down this path and then you had two ways to go, but far, far away there's something that doesn't matter which path you went.

So you took the wrong path, you executed a bunch of stuff. Then you had the mispredicting, you backed it up, but you remembered all the results you already calculated. Some of those are just fine. Like if you read a book and you misunderstand a paragraph, you're understanding the next paragraph.

Sometimes it's invariant to their understanding. Sometimes it depends on it. - And you can kind of anticipate that invariance. - Yeah, well, you can keep track of whether the data changed and so when you come back to a piece of code, should you calculate it again or do the same thing?

- Okay, how much of this is art and how much of it is science? 'Cause it sounds pretty complicated. - Well, how do you describe a situation? So imagine you come to a point in the road where you have to make a decision. And you have a bunch of knowledge about which way to go.

Maybe you have a map. So you wanna go the shortest way or do you wanna go the fastest way or you wanna take the nicest road. So there's some set of data. So imagine you're doing something complicated like building a computer. And there's hundreds of decision points all with hundreds of possible ways to go.

And the ways you pick interact in a complicated way. And then you have to pick the right spot. - Right, so that's-- - Or is that art or science? I don't know. - You avoided the question. You just described the Robert Frost problem of road less taken. - Described the Robert Frost problem?

(laughing) That's what we do as computer designers. It's all poetry. - Okay. - Great. Yeah, I don't know how to describe that because some people are very good at making those intuitive leaps. It seems like this combination of things. Some people are less good at it but they're really good at evaluating the alternatives.

Right, and everybody has a different way to do it. And some people can't make those leaps but they're really good at analyzing it. So when you see computers are designed by teams of people who have very different skill sets and a good team has lots of different kinds of people.

I suspect you would describe some of them as artistic. - Right. - But not very many. - Unfortunately or fortunately. - Unfortunately. Well, you know, computer design's hard. It's 99% perspiration. The 1% inspiration is really important. - But you still need the 99. - Yeah, you gotta do a lot of work.

And then there are interesting things to do at every level of that stack. - At the end of the day, if you run the same program multiple times, does it always produce the same result? Is there some room for fuzziness there? - That's a math problem. So if you run a correct C program, the definition is every time you run it, you get the same answer.

- Yeah, well, that's a math statement. - Well, that's a language definitional statement. So for years when we first did 3D acceleration of graphics, you could run the same scene multiple times and get different answers. - Right. - Right. And then some people thought that was okay and some people thought it was a bad idea.

And then when the HPC world used GPUs for calculations, they thought it was a really bad idea, okay? Now, in modern AI stuff, people are looking at networks where the precision of the data is low enough that the data is somewhat noisy. And the observation is the input data is unbelievably noisy.

So why should the calculation be not noisy? And people have experimented with algorithms that say can get faster answers by being noisy. Like as a network starts to converge, if you look at the computation graph, it starts out really wide and then it gets narrower. And you can say, is that last little bit that important?

Or should I start the graph on the next rev before we whittle it all the way down to the answer? Right, so you can create algorithms that are noisy. Now, if you're developing something and every time you run it, you get a different answer, it's really annoying. And so most people think even today, every time you run the program, you get the same answer.

- Now I know, but the question is, that's the formal definition of a programming language. - There is a definition of languages that don't get the same answer, but people who use those. You always want something 'cause you get a bad answer and then you're wondering, is it because of something in the algorithm or because of this?

And so everybody wants a little switch that says, no matter what, do it deterministically. And it's really weird 'cause almost everything going into modern calculations is noisy. So why do the answers have to be so clear? - Right, so where do you stand? - I design computers for people who run programs.

So if somebody says, I want a deterministic answer, like most people want that. - Can you deliver a deterministic answer, I guess is the question. Like when you-- - Yeah, hopefully, sure. What people don't realize is you get a deterministic answer even though the execution flow is very undeterministic.

So you run this program 100 times, it never runs the same way twice, ever. - And the answer, it arrives at the same answer. - But it gets the same answer every time. - It's just amazing. (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music)

Jim Keller: Abstraction Layers from the Atom to the Data Center | AI Podcast Clips

Transcript