back to index

What every AI engineer needs to know about GPUs — Charles Frye, Modal


Whisper Transcript | Transcript Only Page

00:00:02.000 | - So, what I wanted to talk about today
00:00:18.440 | was what every AI engineer needs to know about GPUs.
00:00:23.100 | So far in the last couple of years,
00:00:28.680 | most of the things that people have built
00:00:30.340 | as AI applications, people who are AI engineers,
00:00:33.100 | they've been building on top of model APIs.
00:00:35.640 | So they use the OpenAI API, the Anthropic API,
00:00:37.980 | the Deep Seek API, and they build an application
00:00:40.760 | on top of that.
00:00:42.080 | And that goes back to kind of like the initial diagram
00:00:45.280 | that Swix put out, the like, rise of the AI engineer thing.
00:00:49.800 | And yeah, probably just mirror would be great.
00:00:54.600 | And having that API boundary is pretty important, right?
00:01:00.480 | It's like, you can't really build a complex system
00:01:03.140 | if everybody has to know how every piece works,
00:01:05.600 | and everybody has to know all of it in detail,
00:01:07.440 | and there's no boundaries or breakdowns.
00:01:09.940 | Like, you're just, yeah.
00:01:11.700 | You'll collapse in complexity if you do that.
00:01:15.040 | So, oh, that wasn't me, but I'm down with it.
00:01:21.600 | So, like, so, yeah, sorry.
00:01:24.540 | It started off by trying to answer the question
00:01:26.160 | of like, why every AI engineer needs to know about GPUs.
00:01:29.380 | And so, yeah, so here's our famous diagram.
00:01:32.160 | AI engineer on the right of the API boundary
00:01:34.960 | where they're like constrained by the needs of users
00:01:39.880 | rather than like what's possible with research
00:01:42.760 | or what infrastructure is capable of providing.
00:01:45.640 | And the way that I think about this distinction
00:01:49.940 | is that it's kind of similar to the way
00:01:52.680 | that very few developers need to actually, like,
00:01:55.760 | write a database.
00:01:57.800 | Like, almost no one writes a database
00:02:00.820 | except in their like, you know, like undergrad classes.
00:02:04.120 | And then even very few developers like run a database.
00:02:06.940 | A lot of them will use either a fully managed service
00:02:10.200 | or just like a hosted service like RDS on Amazon.
00:02:15.200 | But like almost all developers,
00:02:18.020 | despite the fact that they aren't like database engineers,
00:02:22.560 | they are users of databases
00:02:24.100 | and they like need to like write,
00:02:26.000 | know how to like write good queries.
00:02:27.480 | They need to know how to like hold the tool
00:02:29.700 | in order to press the like buttons on the side.
00:02:31.620 | So there's a famous educational resource
00:02:34.320 | that I really love about databases
00:02:37.200 | called use the index loop.
00:02:38.540 | That's like basically about how to write SQL queries
00:02:42.700 | and not like suck.
00:02:44.380 | And the whole point is like there is a thing called an index.
00:02:48.920 | There's a couple of data structures that support it.
00:02:50.840 | It talks about things like B trees
00:02:53.280 | and log structured merge trees and stuff.
00:02:56.000 | And the intent of it isn't that you can then leave
00:02:58.000 | and like invert a binary tree on a whiteboard
00:03:00.500 | so you can get a FANG job.
00:03:01.580 | Like the point of it is to teach you what you need to know
00:03:04.340 | so that you can write like write queries properly
00:03:06.560 | that use the index and don't not use the index.
00:03:09.340 | Primary and secondary indices, all these things.
00:03:11.340 | You don't like, you know,
00:03:12.380 | that's like a little bit like an easier like prospect,
00:03:17.140 | knowing it well enough to be able to use it
00:03:19.080 | rather than like build it or innovate on it.
00:03:21.800 | And I think we're reaching this point now with language models
00:03:25.460 | where you'll have more ability to like integrate tightly,
00:03:30.980 | run your own language models
00:03:32.140 | and so more need to like use the index.
00:03:34.880 | Or I guess if you want like the,
00:03:36.880 | like one sentence summary of this talk,
00:03:39.420 | it's use the tensor cores, Luke.
00:03:42.060 | So in building your, there's one,
00:03:44.200 | there's basically one part of an NVIDIA GPU
00:03:47.180 | and an equivalent in other GPUs
00:03:49.140 | that is fast and good and gets better.
00:03:52.280 | And it's the tensor core.
00:03:53.680 | And it does matrix, matrix multiplication.
00:03:56.140 | And you should make sure you're using it
00:03:58.600 | and not using it just like an index on a database.
00:04:07.660 | So yeah, so open, like I kind of made this point earlier
00:04:10.040 | about open weights models
00:04:11.080 | and the open source software to run them like Dynamo,
00:04:13.820 | getting better very quickly.
00:04:15.040 | So it finally makes sense to self host.
00:04:16.740 | I'm not going to belabor this point
00:04:18.080 | because I'm giving another talk, 12:45,
00:04:20.280 | presenting some like benchmarking results that we did
00:04:23.120 | on like running VLM, SGLang, tensor RTLM
00:04:26.800 | on like 10, 12 different models
00:04:29.320 | on 10, 12 different workloads
00:04:31.680 | to show like what's what's economical, what's not.
00:04:34.840 | OK, so that's the why.
00:04:38.600 | Sort of a slight change or adjustment
00:04:40.480 | in what AI engineers, I think AI engineers should focus on,
00:04:42.960 | know about.
00:04:44.340 | So now, what is it that you need to know about engineer,
00:04:47.140 | about these, this hardware in detail?
00:04:51.400 | So the primary thing is that GPUs embrace high bandwidth,
00:04:56.520 | not low latency.
00:04:57.860 | That's the like key feature of this hardware.
00:05:00.940 | Similar with TPUs, but distinguishes it
00:05:02.820 | from pretty much every other piece of hardware
00:05:04.660 | that you're used to programming.
00:05:07.200 | And then in detail, they optimize for math bandwidth
00:05:11.460 | over memory bandwidth.
00:05:13.120 | So they do like computing on things.
00:05:15.160 | That's where they have the highest throughput.
00:05:18.000 | So you want to align yourself not to latency, but to throughput.
00:05:20.660 | And within throughput, you want to focus
00:05:22.700 | on computational operations.
00:05:24.920 | And then within computational operations,
00:05:26.920 | what you want to focus on if you want to like actually use
00:05:30.200 | the whole GPU you paid for, it's low precision matrix multiplications.
00:05:36.420 | Sorry, that wasn't a stutter.
00:05:37.420 | That was matrix, matrix multiplications, not just matrix vector.
00:05:42.800 | OK, so the first point about latency versus bandwidth.
00:05:46.980 | So I regret to inform you that the scaling of latency
00:05:50.600 | and the reduction of latency in computing systems
00:05:52.740 | died during the Bush administration.
00:05:54.480 | It's not coming back.
00:05:56.460 | See a talk later today for an alternative perspective.
00:05:59.100 | But yeah, GPUs embrace bandwidth scaling.
00:06:04.100 | So a little more detail on that.
00:06:06.780 | So this is a computer or a piece of a computer in case
00:06:09.680 | you haven't looked inside one in a while.
00:06:11.840 | So this is a logic gate from the ZUSA-1 computer built
00:06:15.880 | in Germany in the '30s, kind of first digital computer.
00:06:18.960 | Digital, but not electronic.
00:06:20.300 | It's mechanical.
00:06:21.400 | So there are all these actuator plates in it
00:06:23.460 | that implemented logical operations.
00:06:26.060 | So what you see there on the left is the logical operation
00:06:29.700 | So if two plates are pushed down, then if both of them
00:06:33.120 | are present, then when the other plate pushes forward,
00:06:36.420 | it will push the final plate forward.
00:06:38.160 | That's the logical operation AND.
00:06:40.240 | And the thing that pushes forward
00:06:44.220 | is driven by a clock, like a literal clock.
00:06:47.500 | I guess now everybody has Apple Watches,
00:06:49.040 | but there was a time when you would have a physical clock
00:06:51.780 | for that sort of thing.
00:06:53.160 | So the clock drives these systems
00:07:00.120 | and causes them to compute their logical operations.
00:07:02.140 | So every time the clock ticks, you get a new operation.
00:07:07.440 | So we've changed computers a little bit in that we use
00:07:10.980 | different physics to drive them, but it's still
00:07:13.140 | the same basic abstract system.
00:07:15.140 | There's a sort of a motive force that happens on a clock
00:07:19.480 | cycle that leads to calculations.
00:07:22.660 | And the cool thing about that is that if you just make that faster,
00:07:25.780 | literally nobody has to think about anything
00:07:29.000 | and the computer gets better.
00:07:30.020 | So this was the primary driver of computers getting better
00:07:33.400 | in the '90s.
00:07:34.620 | No recompiling, no rewriting your software.
00:07:38.020 | Everything just got better.
00:07:39.080 | Because now the clock started going twice as fast, right?
00:07:44.180 | And time is very virtual in computers.
00:07:46.540 | And so the program couldn't possibly know the difference.
00:07:49.820 | So that was really great during that mid to late '90s period.
00:07:53.620 | And then that fell off a cliff in the early 2000s.
00:07:56.780 | And this has impacted a lot of computing over the last two
00:08:00.460 | decades.
00:08:00.960 | But actually, its effects are still being felt.
00:08:03.660 | Like all this switch from being able to kind of avoid
00:08:07.780 | the needing to think about performance.
00:08:11.020 | So this is like kind of slowly and inevitably changing pretty much
00:08:13.820 | like everything in software, all kinds of things
00:08:16.660 | you've seen around concurrency, Guilfrey Python,
00:08:19.580 | multiprocessing, async coroutines.
00:08:22.940 | So there's like a couple kind of detailed things to dive in here.
00:08:26.960 | I want to make sure that I give enough time
00:08:28.840 | to talk about the GPU stuff.
00:08:30.620 | But there's kind of two notions of how to make things faster
00:08:32.960 | without doing that.
00:08:33.940 | One is parallel.
00:08:35.000 | So like when you have a clock cycle,
00:08:37.100 | just do two things instead of one.
00:08:38.480 | Sounds like a good idea.
00:08:40.940 | The other one is concurrent, which is a little bit trickier.
00:08:43.660 | But it's like-- so you start doing something.
00:08:45.440 | Clock cycle hits.
00:08:46.260 | You start writing a calculation.
00:08:47.680 | Maybe that calculation takes five clock cycles to finish.
00:08:50.600 | Like instead of waiting for those clock cycles to finish,
00:08:53.400 | try and do five other things with the next couple of clock
00:08:55.700 | cycles.
00:08:56.440 | Make sure your programs really ugly because you
00:08:58.140 | have to write async await everywhere.
00:09:00.360 | And yeah, if you're writing Rust, it's a world of pin.
00:09:05.060 | But yeah, but it helps you keep these super high bandwidth
00:09:09.720 | pipelines busy.
00:09:12.180 | And so these concurrent and parallel--
00:09:14.760 | these are two strategies to maximize bandwidth that
00:09:17.840 | are adopted at the hardware level all the way up
00:09:20.760 | to the programming level with GPUs to take this bandwidth
00:09:24.380 | further than CPUs can.
00:09:26.440 | So GPUs take parallelism further than CPUs.
00:09:29.940 | So I'm comparing an AMD EPYC CPU and an NVIDIA H100 SXM GPU here.
00:09:36.820 | The figure of merit here is the number of parallel threads that
00:09:39.980 | can operate and the wattage at which they operate.
00:09:42.160 | So an H100-- like an AMD EPYC CPU can do two threads per core at about one watt
00:09:50.280 | per thread.
00:09:51.920 | That's not bad.
00:09:52.620 | But an H100 can do over 16,000 parallel threads at 5 centawatts
00:09:57.620 | per thread, which is pretty amazing, very big difference.
00:10:02.080 | And parallel means like literally every clock cycle,
00:10:05.460 | all 16,000 threads of execution make progress at the exact same time.
00:10:10.720 | So what about concurrency?
00:10:11.960 | So it may look like CPUs have an advantage here
00:10:13.880 | because effectively concurrent threads are unbounded.
00:10:16.300 | Like you can just make a thread in Linux.
00:10:17.960 | Like it's free.
00:10:18.720 | The government doesn't want you to know this.
00:10:21.680 | But there's a limit on H100.
00:10:23.220 | So it looks like, oh, wow, only 250,000 threads?
00:10:26.240 | What am I supposed to do with that?
00:10:28.000 | But the difference here is context switching speed.
00:10:30.340 | How quickly can you go from executing one thing to another?
00:10:33.400 | So if our purpose was to take advantage of every clock cycle
00:10:37.720 | and it takes us 1,000 clock cycles, like a microsecond,
00:10:40.960 | to context switch, then our concurrency
00:10:43.420 | is like actually pretty tightly bounded
00:10:46.300 | because we can't do a thing for a whole 1,000 clock cycles.
00:10:50.000 | But in GPUs, context switching happens literally every clock cycle.
00:10:53.560 | It's down there at the warp scheduler inside the hardware.
00:10:57.460 | If you have to think about it that hard,
00:10:59.100 | you're probably having a bad time.
00:11:01.280 | But normally, it's just making everything run faster.
00:11:04.900 | So there's not really a name for the phenomenon
00:11:07.520 | that's driving all of this work.
00:11:09.440 | But David Patterson, who came up with RISC machines
00:11:11.900 | and worked on TPUs, wrote it down.
00:11:14.780 | So I call it Patterson's law, latency lags bandwidth.
00:11:18.740 | So why are we doing all these things, rewriting our programs, rethinking them,
00:11:23.780 | in order to take advantage of increasing bandwidth?
00:11:27.240 | And bandwidth is replacing latency scaling.
00:11:30.240 | It's because if you look across a variety of different subsystems of computers,
00:11:34.120 | networks, networks, memory, disks, the latency improvement is actually the square of--
00:11:39.460 | or sorry, the bandwidth improvement is the square of the latency improvement over time.
00:11:45.460 | This is one of those Moore's Law style charts where you're looking at trends in performance
00:11:50.460 | over time.
00:11:51.460 | And it's like for every 10x that we improve latency, we get 100x improvement in bandwidth.
00:11:56.460 | And there's some arguments in the article about where this comes from.
00:12:01.800 | Basically, with latency, you run into the laws of physics.
00:12:03.800 | With bandwidth, you just run into how many things can you do at the same time.
00:12:07.800 | And you can always take the same physics and spread it out more easily than you can come
00:12:12.300 | up with new physics to take advantage of.
00:12:15.140 | You cannot bribe the laws of physics, Scotty and Star Trek.
00:12:19.220 | And that's one of the limits on network latency.
00:12:23.020 | We send packets at 70% of the speed of light, so we can't get them 10x faster.
00:12:28.140 | Yeah.
00:12:29.140 | Yeah.
00:12:30.140 | All right.
00:12:31.140 | So that's bandwidth.
00:12:32.140 | GPUs embrace bandwidth.
00:12:34.140 | Maybe big takeaway from Patterson's Law is like bandwidth has won out over and over again,
00:12:39.320 | so maybe bet on the bandwidth hardware.
00:12:41.480 | I don't know if the person who's going to be talking about LPUs or etched is here, but we
00:12:46.660 | should fight about this later.
00:12:48.640 | Yeah.
00:12:49.640 | So all right.
00:12:50.640 | What kind of bandwidth, though?
00:12:53.300 | Arithmetic bandwidth over memory bandwidth.
00:12:55.360 | So not moving bytes around, they have high bandwidth memory, the fanciest, finest high bandwidth
00:13:02.600 | memory.
00:13:04.420 | But the thing where they really excel is doing calculations on that memory.
00:13:10.740 | And so the takeaway here is that N squared algorithms are usually bad.
00:13:17.040 | But if it's N squared operations for N memory loads, it actually works out pretty nicely.
00:13:21.820 | It's almost like maybe Bill Daly and others were thinking of this when they built the chip.
00:13:25.340 | I don't know.
00:13:27.760 | So like arithmetic intensity is the term for this, or yeah, math intensity.
00:13:32.400 | And if you look here at the things highlighted in purple, the second, the thing that's nominated
00:13:40.860 | in Terra, numbers that go up into the thousands, memory bandwidth at the bottom is the .
00:13:51.080 | And that has not changed with Blackwell.
00:13:52.500 | It's only gotten worse.
00:13:53.500 | Or better.
00:13:54.500 | I don't know.
00:13:56.500 | The ratio has gone up.
00:13:58.140 | So LLM inference works pretty nicely during prompt processing, where you're moving-- you move
00:14:03.200 | 8 gigabytes, then you-- like, 8 billion parameter model FPA quantization, you're going to move
00:14:07.140 | 8 gigabytes from the memory into the registers for calculation.
00:14:10.560 | You're going to do about 60 billion floating point operations.
00:14:15.460 | So that doesn't really scale too much with the sequence directly.
00:14:21.620 | Anyway, that when you then need to do-- you now need to move those 8 billion parameters again.
00:14:30.180 | So this is from the GPU's memory into the place where the compute happens.
00:14:33.960 | Like, you can't-- you have to-- you know, it's von Neumann architecture.
00:14:37.240 | You can't, like, keep the things-- it's compute on stuff that is in place.
00:14:44.020 | So LLM inference works great during prompt processing, not so much during decoding.
00:14:48.740 | So one way to get around this is to just do more stuff when you're decoding.
00:14:52.640 | So one example is to take a small model-- so 8 billion parameters-- and run it, like,
00:14:57.620 | 1,000 times on the same prompt.
00:14:59.240 | Now you're loading the weights, and you only load the weights one time, but then you generate,
00:15:03.460 | like, 10,000 things.
00:15:05.720 | And so there's, like, kind of an inherent advantage there to small models for being more sympathetic
00:15:10.260 | to the hardware.
00:15:11.800 | You can actually match quality if you do things right, if you have a good verifier, either--
00:15:15.680 | in this case, this is-- does it pass a Python test?
00:15:18.980 | And so that allows you to pick one of your 10,000 outcomes.
00:15:23.220 | And so you can use LLAMA-318B to match GPT-40 with, like, 100-- yeah, 100 generations.
00:15:29.400 | So that's a-- the figure on the left is a reproduction.
00:15:32.600 | I read a research paper.
00:15:33.620 | I sat down, spent a day coding, and I got the exact same result on different data in a different
00:15:38.220 | model, which-- any research for that.
00:15:42.900 | So this is like-- this is legit.
00:15:44.700 | This is real-- this is real science, you know?
00:15:48.240 | And it's a-- so it's a real phenomenon, and it fits with the hardware.
00:15:53.100 | So lastly, like-- so we want to do more-- like, we want to do, like, throughput-oriented,
00:16:00.240 | like, large-scale activities.
00:16:01.620 | We want to do it with, like, computation and mathematics,
00:16:05.240 | not with memory movement.
00:16:06.780 | And the specific thing we want to do is low-precision matrix multiplication.
00:16:11.940 | And the takeaway here is that some surprising things are going to turn out
00:16:14.660 | to be approximately free.
00:16:15.960 | And I don't have time to go into the details on this, but it turns out Kyle Cranin, also
00:16:20.480 | on the Dynamo team, had-- like, came to the exact same conclusion.
00:16:23.820 | We were talking-- you know, comparing notes last night.
00:16:26.720 | So check his talk in the afternoon of the infrastructure track if you want less hand-waving and more charts.
00:16:31.820 | So things like multi-token prediction, multi-samples query, all this stuff suddenly becomes, like, basically free.
00:16:38.680 | And the reason why is that the latest GPUs, NVIDIAs, and others have this giant chunk in them, the tensor core that's specialized for low-precision, matrix-matrix, multiplication.
00:16:50.340 | And so that's, you know, all these things in purple here that have the really big numbers are tensor core output, not the CUDA core output.
00:16:58.060 | And tensor cores do exactly one thing, and it's floating point matrix multiplication.
00:17:03.860 | Bit of a tough world to live in as, like, if you're a theoretical programmer to discover that there's only one data type you're allowed to work with.
00:17:11.920 | But you just get more creative, right? You can do a Fourier transform with this thing if you want.
00:17:15.920 | Yeah, so the generation phase of language models is very heavy on matrix vector operations, if you just, like, write it out at first.
00:17:23.480 | So the things that are basically free are things that can upgrade you to a matrix-matrix operation.
00:17:28.640 | There's some micro benchmarks from the Thunder Kittens people, I think, Hazy Research, that was basically a tensor core.
00:17:35.840 | So it looks like it runs at, like, if you give it a matrix and then, like, a mostly empty matrix with one column full, you get, like, one over N of the performance, right?
00:17:45.400 | And so, you know, if you just add more stuff there, like, all of a sudden, you're, like, the, like, performance is scaling to match.
00:17:53.920 | So this is sort of, yeah, this is another phenomenon that pushes you in the direction of generating multiple samples, generating multiple tokens,
00:18:02.400 | as Deep Seek does with the next token prediction, and I think the Lama 4 models do as well.
00:18:08.040 | So, yeah, so these are the -- so as an AI engineer, the things you should be looking at are, like, okay, like, maybe I can get away with running a smaller model that fits on a GPU that's under my desk.
00:18:19.560 | And then I just, like, scale it out in order to get the, like, sufficient quality to satisfy users.
00:18:26.240 | So there's a bunch of research on this stuff back in, like, around the release of ChatGPT when there was still, like, a thriving academic field on top of language models.
00:18:34.200 | And it hasn't -- like, people have kind of forgotten about it a bit, but I think the model -- the open models are good enough that this is back to being a good idea.
00:18:42.880 | Cool. I think I only have about 10 seconds left, so I'll just say, if you want to learn more, I wrote this GPU glossary, modal.com/GPU-glossary.
00:18:52.520 | It's a CUDA docs for humans attempt to, like, explain this whole software and hardware stack in one place with lots of links so that when you're reading about a warp scheduler and you've forgotten what a streaming multiprocessor architecture is and how that's related to the NVIDIA CUDA compiler driver, it's, like, one click away to get all of those things.
00:19:10.520 | So if you want to run these -- on this hardware, no better place than the platform that I work on, modal, serverless GPUs, and more, we sort of, like, ripped out and rewrote the whole container stack to make, like, serverless Python for data-intensive and compute-intensive workloads, like language model inference, work well.
00:19:37.160 | And so you should definitely check it out. Come find us at the expo hall and we'll, you know, talk your ear off about it.
00:19:43.800 | All right. Thank you very much.
00:19:45.800 | Thank you.