What every AI engineer needs to know about GPUs

- - So, what I wanted to talk about today was what every AI engineer needs to know about GPUs. So far in the last couple of years, most of the things that people have built as AI applications, people who are AI engineers, they've been building on top of model APIs.

So they use the OpenAI API, the Anthropic API, the Deep Seek API, and they build an application on top of that. And that goes back to kind of like the initial diagram that Swix put out, the like, rise of the AI engineer thing. And yeah, probably just mirror would be great.

And having that API boundary is pretty important, right? It's like, you can't really build a complex system if everybody has to know how every piece works, and everybody has to know all of it in detail, and there's no boundaries or breakdowns. Like, you're just, yeah. You'll collapse in complexity if you do that.

So, oh, that wasn't me, but I'm down with it. So, like, so, yeah, sorry. It started off by trying to answer the question of like, why every AI engineer needs to know about GPUs. And so, yeah, so here's our famous diagram. AI engineer on the right of the API boundary where they're like constrained by the needs of users rather than like what's possible with research or what infrastructure is capable of providing.

And the way that I think about this distinction is that it's kind of similar to the way that very few developers need to actually, like, write a database. Like, almost no one writes a database except in their like, you know, like undergrad classes. And then even very few developers like run a database.

A lot of them will use either a fully managed service or just like a hosted service like RDS on Amazon. But like almost all developers, despite the fact that they aren't like database engineers, they are users of databases and they like need to like write, know how to like write good queries.

They need to know how to like hold the tool in order to press the like buttons on the side. So there's a famous educational resource that I really love about databases called use the index loop. That's like basically about how to write SQL queries and not like suck. And the whole point is like there is a thing called an index.

There's a couple of data structures that support it. It talks about things like B trees and log structured merge trees and stuff. And the intent of it isn't that you can then leave and like invert a binary tree on a whiteboard so you can get a FANG job. Like the point of it is to teach you what you need to know so that you can write like write queries properly that use the index and don't not use the index.

Primary and secondary indices, all these things. You don't like, you know, that's like a little bit like an easier like prospect, knowing it well enough to be able to use it rather than like build it or innovate on it. And I think we're reaching this point now with language models where you'll have more ability to like integrate tightly, run your own language models and so more need to like use the index.

Or I guess if you want like the, like one sentence summary of this talk, it's use the tensor cores, Luke. So in building your, there's one, there's basically one part of an NVIDIA GPU and an equivalent in other GPUs that is fast and good and gets better. And it's the tensor core.

And it does matrix, matrix multiplication. And you should make sure you're using it and not using it just like an index on a database. So yeah, so open, like I kind of made this point earlier about open weights models and the open source software to run them like Dynamo, getting better very quickly.

So it finally makes sense to self host. I'm not going to belabor this point because I'm giving another talk, 12:45, presenting some like benchmarking results that we did on like running VLM, SGLang, tensor RTLM on like 10, 12 different models on 10, 12 different workloads to show like what's what's economical, what's not.

OK, so that's the why. Sort of a slight change or adjustment in what AI engineers, I think AI engineers should focus on, know about. So now, what is it that you need to know about engineer, about these, this hardware in detail? So the primary thing is that GPUs embrace high bandwidth, not low latency.

That's the like key feature of this hardware. Similar with TPUs, but distinguishes it from pretty much every other piece of hardware that you're used to programming. And then in detail, they optimize for math bandwidth over memory bandwidth. So they do like computing on things. That's where they have the highest throughput.

So you want to align yourself not to latency, but to throughput. And within throughput, you want to focus on computational operations. And then within computational operations, what you want to focus on if you want to like actually use the whole GPU you paid for, it's low precision matrix multiplications.

Sorry, that wasn't a stutter. That was matrix, matrix multiplications, not just matrix vector. OK, so the first point about latency versus bandwidth. So I regret to inform you that the scaling of latency and the reduction of latency in computing systems died during the Bush administration. It's not coming back.

See a talk later today for an alternative perspective. But yeah, GPUs embrace bandwidth scaling. So a little more detail on that. So this is a computer or a piece of a computer in case you haven't looked inside one in a while. So this is a logic gate from the ZUSA-1 computer built in Germany in the '30s, kind of first digital computer.

Digital, but not electronic. It's mechanical. So there are all these actuator plates in it that implemented logical operations. So what you see there on the left is the logical operation AND. So if two plates are pushed down, then if both of them are present, then when the other plate pushes forward, it will push the final plate forward.

That's the logical operation AND. And the thing that pushes forward is driven by a clock, like a literal clock. I guess now everybody has Apple Watches, but there was a time when you would have a physical clock for that sort of thing. So the clock drives these systems and causes them to compute their logical operations.

So every time the clock ticks, you get a new operation. So we've changed computers a little bit in that we use different physics to drive them, but it's still the same basic abstract system. There's a sort of a motive force that happens on a clock cycle that leads to calculations.

And the cool thing about that is that if you just make that faster, literally nobody has to think about anything and the computer gets better. So this was the primary driver of computers getting better in the '90s. No recompiling, no rewriting your software. Everything just got better. Because now the clock started going twice as fast, right?

And time is very virtual in computers. And so the program couldn't possibly know the difference. So that was really great during that mid to late '90s period. And then that fell off a cliff in the early 2000s. And this has impacted a lot of computing over the last two decades.

But actually, its effects are still being felt. Like all this switch from being able to kind of avoid the needing to think about performance. So this is like kind of slowly and inevitably changing pretty much like everything in software, all kinds of things you've seen around concurrency, Guilfrey Python, multiprocessing, async coroutines.

So there's like a couple kind of detailed things to dive in here. I want to make sure that I give enough time to talk about the GPU stuff. But there's kind of two notions of how to make things faster without doing that. One is parallel. So like when you have a clock cycle, just do two things instead of one.

Sounds like a good idea. The other one is concurrent, which is a little bit trickier. But it's like-- so you start doing something. Clock cycle hits. You start writing a calculation. Maybe that calculation takes five clock cycles to finish. Like instead of waiting for those clock cycles to finish, try and do five other things with the next couple of clock cycles.

Make sure your programs really ugly because you have to write async await everywhere. And yeah, if you're writing Rust, it's a world of pin. But yeah, but it helps you keep these super high bandwidth pipelines busy. And so these concurrent and parallel-- these are two strategies to maximize bandwidth that are adopted at the hardware level all the way up to the programming level with GPUs to take this bandwidth further than CPUs can.

So GPUs take parallelism further than CPUs. So I'm comparing an AMD EPYC CPU and an NVIDIA H100 SXM GPU here. The figure of merit here is the number of parallel threads that can operate and the wattage at which they operate. So an H100-- like an AMD EPYC CPU can do two threads per core at about one watt per thread.

That's not bad. But an H100 can do over 16,000 parallel threads at 5 centawatts per thread, which is pretty amazing, very big difference. And parallel means like literally every clock cycle, all 16,000 threads of execution make progress at the exact same time. So what about concurrency? So it may look like CPUs have an advantage here because effectively concurrent threads are unbounded.

Like you can just make a thread in Linux. Like it's free. The government doesn't want you to know this. But there's a limit on H100. So it looks like, oh, wow, only 250,000 threads? What am I supposed to do with that? But the difference here is context switching speed.

How quickly can you go from executing one thing to another? So if our purpose was to take advantage of every clock cycle and it takes us 1,000 clock cycles, like a microsecond, to context switch, then our concurrency is like actually pretty tightly bounded because we can't do a thing for a whole 1,000 clock cycles.

But in GPUs, context switching happens literally every clock cycle. It's down there at the warp scheduler inside the hardware. If you have to think about it that hard, you're probably having a bad time. But normally, it's just making everything run faster. So there's not really a name for the phenomenon that's driving all of this work.

But David Patterson, who came up with RISC machines and worked on TPUs, wrote it down. So I call it Patterson's law, latency lags bandwidth. So why are we doing all these things, rewriting our programs, rethinking them, in order to take advantage of increasing bandwidth? And bandwidth is replacing latency scaling.

It's because if you look across a variety of different subsystems of computers, networks, networks, memory, disks, the latency improvement is actually the square of-- or sorry, the bandwidth improvement is the square of the latency improvement over time. This is one of those Moore's Law style charts where you're looking at trends in performance over time.

And it's like for every 10x that we improve latency, we get 100x improvement in bandwidth. And there's some arguments in the article about where this comes from. Basically, with latency, you run into the laws of physics. With bandwidth, you just run into how many things can you do at the same time.

And you can always take the same physics and spread it out more easily than you can come up with new physics to take advantage of. You cannot bribe the laws of physics, Scotty and Star Trek. And that's one of the limits on network latency. We send packets at 70% of the speed of light, so we can't get them 10x faster.

Yeah. Yeah. All right. So that's bandwidth. GPUs embrace bandwidth. Maybe big takeaway from Patterson's Law is like bandwidth has won out over and over again, so maybe bet on the bandwidth hardware. I don't know if the person who's going to be talking about LPUs or etched is here, but we should fight about this later.

Yeah. So all right. What kind of bandwidth, though? Arithmetic bandwidth over memory bandwidth. So not moving bytes around, they have high bandwidth memory, the fanciest, finest high bandwidth memory. But the thing where they really excel is doing calculations on that memory. And so the takeaway here is that N squared algorithms are usually bad.

But if it's N squared operations for N memory loads, it actually works out pretty nicely. It's almost like maybe Bill Daly and others were thinking of this when they built the chip. I don't know. So like arithmetic intensity is the term for this, or yeah, math intensity. And if you look here at the things highlighted in purple, the second, the thing that's nominated in Terra, numbers that go up into the thousands, memory bandwidth at the bottom is the .

And that has not changed with Blackwell. It's only gotten worse. Or better. I don't know. The ratio has gone up. So LLM inference works pretty nicely during prompt processing, where you're moving-- you move 8 gigabytes, then you-- like, 8 billion parameter model FPA quantization, you're going to move 8 gigabytes from the memory into the registers for calculation.

You're going to do about 60 billion floating point operations. So that doesn't really scale too much with the sequence directly. Anyway, that when you then need to do-- you now need to move those 8 billion parameters again. So this is from the GPU's memory into the place where the compute happens.

Like, you can't-- you have to-- you know, it's von Neumann architecture. You can't, like, keep the things-- it's compute on stuff that is in place. So LLM inference works great during prompt processing, not so much during decoding. So one way to get around this is to just do more stuff when you're decoding.

So one example is to take a small model-- so 8 billion parameters-- and run it, like, 1,000 times on the same prompt. Now you're loading the weights, and you only load the weights one time, but then you generate, like, 10,000 things. And so there's, like, kind of an inherent advantage there to small models for being more sympathetic to the hardware.

You can actually match quality if you do things right, if you have a good verifier, either-- in this case, this is-- does it pass a Python test? And so that allows you to pick one of your 10,000 outcomes. And so you can use LLAMA-318B to match GPT-40 with, like, 100-- yeah, 100 generations.

So that's a-- the figure on the left is a reproduction. I read a research paper. I sat down, spent a day coding, and I got the exact same result on different data in a different model, which-- any research for that. So this is like-- this is legit. This is real-- this is real science, you know?

And it's a-- so it's a real phenomenon, and it fits with the hardware. So lastly, like-- so we want to do more-- like, we want to do, like, throughput-oriented, like, large-scale activities. We want to do it with, like, computation and mathematics, not with memory movement. And the specific thing we want to do is low-precision matrix multiplication.

And the takeaway here is that some surprising things are going to turn out to be approximately free. And I don't have time to go into the details on this, but it turns out Kyle Cranin, also on the Dynamo team, had-- like, came to the exact same conclusion. We were talking-- you know, comparing notes last night.

So check his talk in the afternoon of the infrastructure track if you want less hand-waving and more charts. So things like multi-token prediction, multi-samples query, all this stuff suddenly becomes, like, basically free. And the reason why is that the latest GPUs, NVIDIAs, and others have this giant chunk in them, the tensor core that's specialized for low-precision, matrix-matrix, multiplication.

And so that's, you know, all these things in purple here that have the really big numbers are tensor core output, not the CUDA core output. And tensor cores do exactly one thing, and it's floating point matrix multiplication. Bit of a tough world to live in as, like, if you're a theoretical programmer to discover that there's only one data type you're allowed to work with.

But you just get more creative, right? You can do a Fourier transform with this thing if you want. Yeah, so the generation phase of language models is very heavy on matrix vector operations, if you just, like, write it out at first. So the things that are basically free are things that can upgrade you to a matrix-matrix operation.

There's some micro benchmarks from the Thunder Kittens people, I think, Hazy Research, that was basically a tensor core. So it looks like it runs at, like, if you give it a matrix and then, like, a mostly empty matrix with one column full, you get, like, one over N of the performance, right?

And so, you know, if you just add more stuff there, like, all of a sudden, you're, like, the, like, performance is scaling to match. So this is sort of, yeah, this is another phenomenon that pushes you in the direction of generating multiple samples, generating multiple tokens, as Deep Seek does with the next token prediction, and I think the Lama 4 models do as well.

So, yeah, so these are the -- so as an AI engineer, the things you should be looking at are, like, okay, like, maybe I can get away with running a smaller model that fits on a GPU that's under my desk. And then I just, like, scale it out in order to get the, like, sufficient quality to satisfy users.

So there's a bunch of research on this stuff back in, like, around the release of ChatGPT when there was still, like, a thriving academic field on top of language models. And it hasn't -- like, people have kind of forgotten about it a bit, but I think the model -- the open models are good enough that this is back to being a good idea.

Cool. I think I only have about 10 seconds left, so I'll just say, if you want to learn more, I wrote this GPU glossary, modal.com/GPU-glossary. It's a CUDA docs for humans attempt to, like, explain this whole software and hardware stack in one place with lots of links so that when you're reading about a warp scheduler and you've forgotten what a streaming multiprocessor architecture is and how that's related to the NVIDIA CUDA compiler driver, it's, like, one click away to get all of those things.

So if you want to run these -- on this hardware, no better place than the platform that I work on, modal, serverless GPUs, and more, we sort of, like, ripped out and rewrote the whole container stack to make, like, serverless Python for data-intensive and compute-intensive workloads, like language model inference, work well.

And so you should definitely check it out. Come find us at the expo hall and we'll, you know, talk your ear off about it. All right. Thank you very much. Thank you. you

What every AI engineer needs to know about GPUs — Charles Frye, Modal

Transcript