back to indexWhat every AI engineer needs to know about GPUs — Charles Frye, Modal

00:00:18.440 |
was what every AI engineer needs to know about GPUs. 00:00:30.340 |
as AI applications, people who are AI engineers, 00:00:35.640 |
So they use the OpenAI API, the Anthropic API, 00:00:37.980 |
the Deep Seek API, and they build an application 00:00:42.080 |
And that goes back to kind of like the initial diagram 00:00:45.280 |
that Swix put out, the like, rise of the AI engineer thing. 00:00:49.800 |
And yeah, probably just mirror would be great. 00:00:54.600 |
And having that API boundary is pretty important, right? 00:01:00.480 |
It's like, you can't really build a complex system 00:01:03.140 |
if everybody has to know how every piece works, 00:01:05.600 |
and everybody has to know all of it in detail, 00:01:11.700 |
You'll collapse in complexity if you do that. 00:01:15.040 |
So, oh, that wasn't me, but I'm down with it. 00:01:24.540 |
It started off by trying to answer the question 00:01:26.160 |
of like, why every AI engineer needs to know about GPUs. 00:01:34.960 |
where they're like constrained by the needs of users 00:01:39.880 |
rather than like what's possible with research 00:01:42.760 |
or what infrastructure is capable of providing. 00:01:45.640 |
And the way that I think about this distinction 00:01:52.680 |
that very few developers need to actually, like, 00:02:00.820 |
except in their like, you know, like undergrad classes. 00:02:04.120 |
And then even very few developers like run a database. 00:02:06.940 |
A lot of them will use either a fully managed service 00:02:10.200 |
or just like a hosted service like RDS on Amazon. 00:02:18.020 |
despite the fact that they aren't like database engineers, 00:02:29.700 |
in order to press the like buttons on the side. 00:02:38.540 |
That's like basically about how to write SQL queries 00:02:44.380 |
And the whole point is like there is a thing called an index. 00:02:48.920 |
There's a couple of data structures that support it. 00:02:56.000 |
And the intent of it isn't that you can then leave 00:02:58.000 |
and like invert a binary tree on a whiteboard 00:03:01.580 |
Like the point of it is to teach you what you need to know 00:03:04.340 |
so that you can write like write queries properly 00:03:06.560 |
that use the index and don't not use the index. 00:03:09.340 |
Primary and secondary indices, all these things. 00:03:12.380 |
that's like a little bit like an easier like prospect, 00:03:21.800 |
And I think we're reaching this point now with language models 00:03:25.460 |
where you'll have more ability to like integrate tightly, 00:03:58.600 |
and not using it just like an index on a database. 00:04:07.660 |
So yeah, so open, like I kind of made this point earlier 00:04:11.080 |
and the open source software to run them like Dynamo, 00:04:20.280 |
presenting some like benchmarking results that we did 00:04:31.680 |
to show like what's what's economical, what's not. 00:04:40.480 |
in what AI engineers, I think AI engineers should focus on, 00:04:44.340 |
So now, what is it that you need to know about engineer, 00:04:51.400 |
So the primary thing is that GPUs embrace high bandwidth, 00:04:57.860 |
That's the like key feature of this hardware. 00:05:02.820 |
from pretty much every other piece of hardware 00:05:07.200 |
And then in detail, they optimize for math bandwidth 00:05:15.160 |
That's where they have the highest throughput. 00:05:18.000 |
So you want to align yourself not to latency, but to throughput. 00:05:26.920 |
what you want to focus on if you want to like actually use 00:05:30.200 |
the whole GPU you paid for, it's low precision matrix multiplications. 00:05:37.420 |
That was matrix, matrix multiplications, not just matrix vector. 00:05:42.800 |
OK, so the first point about latency versus bandwidth. 00:05:46.980 |
So I regret to inform you that the scaling of latency 00:05:50.600 |
and the reduction of latency in computing systems 00:05:56.460 |
See a talk later today for an alternative perspective. 00:06:06.780 |
So this is a computer or a piece of a computer in case 00:06:11.840 |
So this is a logic gate from the ZUSA-1 computer built 00:06:15.880 |
in Germany in the '30s, kind of first digital computer. 00:06:26.060 |
So what you see there on the left is the logical operation 00:06:29.700 |
So if two plates are pushed down, then if both of them 00:06:33.120 |
are present, then when the other plate pushes forward, 00:06:49.040 |
but there was a time when you would have a physical clock 00:07:00.120 |
and causes them to compute their logical operations. 00:07:02.140 |
So every time the clock ticks, you get a new operation. 00:07:07.440 |
So we've changed computers a little bit in that we use 00:07:10.980 |
different physics to drive them, but it's still 00:07:15.140 |
There's a sort of a motive force that happens on a clock 00:07:22.660 |
And the cool thing about that is that if you just make that faster, 00:07:30.020 |
So this was the primary driver of computers getting better 00:07:39.080 |
Because now the clock started going twice as fast, right? 00:07:46.540 |
And so the program couldn't possibly know the difference. 00:07:49.820 |
So that was really great during that mid to late '90s period. 00:07:53.620 |
And then that fell off a cliff in the early 2000s. 00:07:56.780 |
And this has impacted a lot of computing over the last two 00:08:00.960 |
But actually, its effects are still being felt. 00:08:03.660 |
Like all this switch from being able to kind of avoid 00:08:11.020 |
So this is like kind of slowly and inevitably changing pretty much 00:08:13.820 |
like everything in software, all kinds of things 00:08:16.660 |
you've seen around concurrency, Guilfrey Python, 00:08:22.940 |
So there's like a couple kind of detailed things to dive in here. 00:08:30.620 |
But there's kind of two notions of how to make things faster 00:08:40.940 |
The other one is concurrent, which is a little bit trickier. 00:08:43.660 |
But it's like-- so you start doing something. 00:08:47.680 |
Maybe that calculation takes five clock cycles to finish. 00:08:50.600 |
Like instead of waiting for those clock cycles to finish, 00:08:53.400 |
try and do five other things with the next couple of clock 00:08:56.440 |
Make sure your programs really ugly because you 00:09:00.360 |
And yeah, if you're writing Rust, it's a world of pin. 00:09:05.060 |
But yeah, but it helps you keep these super high bandwidth 00:09:14.760 |
these are two strategies to maximize bandwidth that 00:09:17.840 |
are adopted at the hardware level all the way up 00:09:20.760 |
to the programming level with GPUs to take this bandwidth 00:09:29.940 |
So I'm comparing an AMD EPYC CPU and an NVIDIA H100 SXM GPU here. 00:09:36.820 |
The figure of merit here is the number of parallel threads that 00:09:39.980 |
can operate and the wattage at which they operate. 00:09:42.160 |
So an H100-- like an AMD EPYC CPU can do two threads per core at about one watt 00:09:52.620 |
But an H100 can do over 16,000 parallel threads at 5 centawatts 00:09:57.620 |
per thread, which is pretty amazing, very big difference. 00:10:02.080 |
And parallel means like literally every clock cycle, 00:10:05.460 |
all 16,000 threads of execution make progress at the exact same time. 00:10:11.960 |
So it may look like CPUs have an advantage here 00:10:13.880 |
because effectively concurrent threads are unbounded. 00:10:18.720 |
The government doesn't want you to know this. 00:10:23.220 |
So it looks like, oh, wow, only 250,000 threads? 00:10:28.000 |
But the difference here is context switching speed. 00:10:30.340 |
How quickly can you go from executing one thing to another? 00:10:33.400 |
So if our purpose was to take advantage of every clock cycle 00:10:37.720 |
and it takes us 1,000 clock cycles, like a microsecond, 00:10:46.300 |
because we can't do a thing for a whole 1,000 clock cycles. 00:10:50.000 |
But in GPUs, context switching happens literally every clock cycle. 00:10:53.560 |
It's down there at the warp scheduler inside the hardware. 00:11:01.280 |
But normally, it's just making everything run faster. 00:11:04.900 |
So there's not really a name for the phenomenon 00:11:09.440 |
But David Patterson, who came up with RISC machines 00:11:14.780 |
So I call it Patterson's law, latency lags bandwidth. 00:11:18.740 |
So why are we doing all these things, rewriting our programs, rethinking them, 00:11:23.780 |
in order to take advantage of increasing bandwidth? 00:11:30.240 |
It's because if you look across a variety of different subsystems of computers, 00:11:34.120 |
networks, networks, memory, disks, the latency improvement is actually the square of-- 00:11:39.460 |
or sorry, the bandwidth improvement is the square of the latency improvement over time. 00:11:45.460 |
This is one of those Moore's Law style charts where you're looking at trends in performance 00:11:51.460 |
And it's like for every 10x that we improve latency, we get 100x improvement in bandwidth. 00:11:56.460 |
And there's some arguments in the article about where this comes from. 00:12:01.800 |
Basically, with latency, you run into the laws of physics. 00:12:03.800 |
With bandwidth, you just run into how many things can you do at the same time. 00:12:07.800 |
And you can always take the same physics and spread it out more easily than you can come 00:12:15.140 |
You cannot bribe the laws of physics, Scotty and Star Trek. 00:12:19.220 |
And that's one of the limits on network latency. 00:12:23.020 |
We send packets at 70% of the speed of light, so we can't get them 10x faster. 00:12:34.140 |
Maybe big takeaway from Patterson's Law is like bandwidth has won out over and over again, 00:12:41.480 |
I don't know if the person who's going to be talking about LPUs or etched is here, but we 00:12:55.360 |
So not moving bytes around, they have high bandwidth memory, the fanciest, finest high bandwidth 00:13:04.420 |
But the thing where they really excel is doing calculations on that memory. 00:13:10.740 |
And so the takeaway here is that N squared algorithms are usually bad. 00:13:17.040 |
But if it's N squared operations for N memory loads, it actually works out pretty nicely. 00:13:21.820 |
It's almost like maybe Bill Daly and others were thinking of this when they built the chip. 00:13:27.760 |
So like arithmetic intensity is the term for this, or yeah, math intensity. 00:13:32.400 |
And if you look here at the things highlighted in purple, the second, the thing that's nominated 00:13:40.860 |
in Terra, numbers that go up into the thousands, memory bandwidth at the bottom is the . 00:13:58.140 |
So LLM inference works pretty nicely during prompt processing, where you're moving-- you move 00:14:03.200 |
8 gigabytes, then you-- like, 8 billion parameter model FPA quantization, you're going to move 00:14:07.140 |
8 gigabytes from the memory into the registers for calculation. 00:14:10.560 |
You're going to do about 60 billion floating point operations. 00:14:15.460 |
So that doesn't really scale too much with the sequence directly. 00:14:21.620 |
Anyway, that when you then need to do-- you now need to move those 8 billion parameters again. 00:14:30.180 |
So this is from the GPU's memory into the place where the compute happens. 00:14:33.960 |
Like, you can't-- you have to-- you know, it's von Neumann architecture. 00:14:37.240 |
You can't, like, keep the things-- it's compute on stuff that is in place. 00:14:44.020 |
So LLM inference works great during prompt processing, not so much during decoding. 00:14:48.740 |
So one way to get around this is to just do more stuff when you're decoding. 00:14:52.640 |
So one example is to take a small model-- so 8 billion parameters-- and run it, like, 00:14:59.240 |
Now you're loading the weights, and you only load the weights one time, but then you generate, 00:15:05.720 |
And so there's, like, kind of an inherent advantage there to small models for being more sympathetic 00:15:11.800 |
You can actually match quality if you do things right, if you have a good verifier, either-- 00:15:15.680 |
in this case, this is-- does it pass a Python test? 00:15:18.980 |
And so that allows you to pick one of your 10,000 outcomes. 00:15:23.220 |
And so you can use LLAMA-318B to match GPT-40 with, like, 100-- yeah, 100 generations. 00:15:29.400 |
So that's a-- the figure on the left is a reproduction. 00:15:33.620 |
I sat down, spent a day coding, and I got the exact same result on different data in a different 00:15:44.700 |
This is real-- this is real science, you know? 00:15:48.240 |
And it's a-- so it's a real phenomenon, and it fits with the hardware. 00:15:53.100 |
So lastly, like-- so we want to do more-- like, we want to do, like, throughput-oriented, 00:16:01.620 |
We want to do it with, like, computation and mathematics, 00:16:06.780 |
And the specific thing we want to do is low-precision matrix multiplication. 00:16:11.940 |
And the takeaway here is that some surprising things are going to turn out 00:16:15.960 |
And I don't have time to go into the details on this, but it turns out Kyle Cranin, also 00:16:20.480 |
on the Dynamo team, had-- like, came to the exact same conclusion. 00:16:23.820 |
We were talking-- you know, comparing notes last night. 00:16:26.720 |
So check his talk in the afternoon of the infrastructure track if you want less hand-waving and more charts. 00:16:31.820 |
So things like multi-token prediction, multi-samples query, all this stuff suddenly becomes, like, basically free. 00:16:38.680 |
And the reason why is that the latest GPUs, NVIDIAs, and others have this giant chunk in them, the tensor core that's specialized for low-precision, matrix-matrix, multiplication. 00:16:50.340 |
And so that's, you know, all these things in purple here that have the really big numbers are tensor core output, not the CUDA core output. 00:16:58.060 |
And tensor cores do exactly one thing, and it's floating point matrix multiplication. 00:17:03.860 |
Bit of a tough world to live in as, like, if you're a theoretical programmer to discover that there's only one data type you're allowed to work with. 00:17:11.920 |
But you just get more creative, right? You can do a Fourier transform with this thing if you want. 00:17:15.920 |
Yeah, so the generation phase of language models is very heavy on matrix vector operations, if you just, like, write it out at first. 00:17:23.480 |
So the things that are basically free are things that can upgrade you to a matrix-matrix operation. 00:17:28.640 |
There's some micro benchmarks from the Thunder Kittens people, I think, Hazy Research, that was basically a tensor core. 00:17:35.840 |
So it looks like it runs at, like, if you give it a matrix and then, like, a mostly empty matrix with one column full, you get, like, one over N of the performance, right? 00:17:45.400 |
And so, you know, if you just add more stuff there, like, all of a sudden, you're, like, the, like, performance is scaling to match. 00:17:53.920 |
So this is sort of, yeah, this is another phenomenon that pushes you in the direction of generating multiple samples, generating multiple tokens, 00:18:02.400 |
as Deep Seek does with the next token prediction, and I think the Lama 4 models do as well. 00:18:08.040 |
So, yeah, so these are the -- so as an AI engineer, the things you should be looking at are, like, okay, like, maybe I can get away with running a smaller model that fits on a GPU that's under my desk. 00:18:19.560 |
And then I just, like, scale it out in order to get the, like, sufficient quality to satisfy users. 00:18:26.240 |
So there's a bunch of research on this stuff back in, like, around the release of ChatGPT when there was still, like, a thriving academic field on top of language models. 00:18:34.200 |
And it hasn't -- like, people have kind of forgotten about it a bit, but I think the model -- the open models are good enough that this is back to being a good idea. 00:18:42.880 |
Cool. I think I only have about 10 seconds left, so I'll just say, if you want to learn more, I wrote this GPU glossary, modal.com/GPU-glossary. 00:18:52.520 |
It's a CUDA docs for humans attempt to, like, explain this whole software and hardware stack in one place with lots of links so that when you're reading about a warp scheduler and you've forgotten what a streaming multiprocessor architecture is and how that's related to the NVIDIA CUDA compiler driver, it's, like, one click away to get all of those things. 00:19:10.520 |
So if you want to run these -- on this hardware, no better place than the platform that I work on, modal, serverless GPUs, and more, we sort of, like, ripped out and rewrote the whole container stack to make, like, serverless Python for data-intensive and compute-intensive workloads, like language model inference, work well. 00:19:37.160 |
And so you should definitely check it out. Come find us at the expo hall and we'll, you know, talk your ear off about it.