back to indexSystem Design for Next-Gen Frontier Models — Dylan Patel, SemiAnalysis

00:00:00.000 |
A couple different things, right? Like, you know, people have been talking about stagnation, 00:00:16.860 |
and I don't think anyone else, anyone here sees that. But a lot of people have been talking about 00:00:22.900 |
stagnation of models, and a lot of that has to just do with the fact that we haven't seen 00:00:27.820 |
a big capabilities leap in the last bit. But that comes really from models that we're using today are 00:00:36.220 |
largely the same as the models that were trained in 2022, right? GPT-4, 4Turbo, 4.0, those are just 00:00:42.420 |
smaller models that are trained for longer, so similar quality, right? You know, 3.5 Synet came 00:00:47.420 |
out recently, but again, that's actually smaller than Opus, but it's somehow better because they 00:00:51.760 |
trained it for longer, right? But we haven't seen an extremely large model come out yet, but we will 00:00:56.900 |
soon. But one interesting thing, right, is GPT-4 is like 1.8 trillion parameters. It's crazy, 00:01:01.720 |
crazy expensive to run, right? 200 billion parameters. Each token requires, you know, 00:01:08.180 |
almost 600 gigaflops, but that's almost going to be considered a last generation model, right, 00:01:14.480 |
in a year from now. So there's a couple of things that I wanted to talk about regarding that, right? 00:01:18.900 |
And mostly on the inference side, because I don't think, you know, anyone here is going to try and 00:01:23.120 |
train that kind of next generation model, but definitely we need to be able to run it. 00:01:28.120 |
And so, you know, a few things, right? So just going to break down inference in detail, 00:01:33.440 |
right? You know, there's two parts of inference, right? There's pre-fill, there's decode. Pre-fill 00:01:38.820 |
is the prompt processing, right? And the interesting thing is if you have a 2K prompt, 2K context-length 00:01:45.360 |
prompt, right, 2,000 tokens you input into GPT, that's a petaflop itself, right? And then, you know, 00:01:52.340 |
if you have 32,000 prompt that you enter, it's 20 petaflops, actually. So it's an incredible amount 00:01:56.840 |
of compute that's required to just process the prompt. And, you know, while pre-fill is very 00:02:04.860 |
compute-intensive, right, it's actually the opposite of decode, right? Decode is actually generating 00:02:09.300 |
each token iteratively, right? So you process the prompt, then you generate a token, you feed it back 00:02:14.720 |
in, and you keep going iteratively, right? And decode is extremely memory bandwidth intensive, 00:02:19.860 |
right? You have to load the whole model from the weights, the entire, all the weights into the chip, 00:02:26.800 |
right, or chips for decode. And the big challenge here is that, you know, hey, if you have 1.8 trillion 00:02:34.280 |
parameters, if you're running at a reasonable batch size, you're activating all the experts, 00:02:37.420 |
you need to load all 1.8 trillion parameters every single token generation, right? Even if you're 00:02:44.460 |
serving multiple users at once, that means you're, you need, you know, 1.8, you need terabytes a second 00:02:51.540 |
of memory bandwidth. You want to do 30 tokens per second, I think that's like a minimum bar for most 00:02:55.760 |
people, right? A lot of people want hundreds of tokens per second, but even if you want 30 tokens 00:03:00.960 |
per second per user, 64 users, you need 60 terabytes a second of memory bandwidth. Even if you look at an 00:03:06.960 |
H100, it has like three, right? So this is a extremely challenging systems problem. More, you know, 00:03:14.200 |
while it is very bandwidth intensive, it's actually quite cheap on the compute, which is why if you 00:03:19.220 |
look at like open AI pricing or cloud pricing, you see a three or four to one ratio between pre-fill 00:03:25.060 |
versus decode pricing, right? So the input tokens cost, you know, one-third that of the output tokens, 00:03:31.940 |
or one-fourth that. So, you know, today the best models, I think, 4.0 and 3.5 Senet are, 00:03:39.140 |
I want to say it's $15 per million tokens, and then it's $5 per million tokens for input, 00:03:44.160 |
$15 for output, so $5 for pre-fill, $15 for decode. And soon we're going to have, you know, 00:03:52.000 |
in the open source, you know, so what everyone here can touch is, is Llama 3.405b, right? And that's 00:03:57.060 |
going to be a real capability sort of unlock for the, you know, the open source market as well as, 00:04:04.420 |
you know, builders here, right? And I think there's a couple things that people really need to be able 00:04:09.380 |
to implement, right? Like, you can't just run Llama CPP on Llama 4.5b, right? Like, it's just not going 00:04:14.840 |
to work. So there's a bunch of stuff that people have to work on, you know, whether it's using, you 00:04:20.560 |
know, closed source libraries like TensorRTLLM that only work on NVIDIA, or like VLLM, which is an open 00:04:26.240 |
source library that works on AMD and Intel and soon other people's chips as well. You know, there's a lot 00:04:33.860 |
of stuff that people need to figure out. One of those is continuous batching, right? Because you're 00:04:38.300 |
going to get, you know, running inference at batch size one is horrendously expensive. You know, it's 00:04:43.800 |
great to run out if you're running it on your own personal devices, but if you're running it in the 00:04:47.560 |
cloud, right, you're renting GPUs, you're running batch size one, you're going to cost yourself 10x more. 00:04:52.840 |
You know, 10x is a low bar, right? It's actually could be 10x to 100x more than running at a high 00:04:57.860 |
batch, right? So you have to figure out how to run high batch sizes, batch sizes, how many concurrent 00:05:03.020 |
users you're serving. And so one of those things that makes it difficult is that users requests 00:05:08.460 |
come in at different times, right? One person might send a request now, and then another person sends 00:05:13.080 |
in a request five seconds later, but the first person's request is not done. So you need to be 00:05:17.240 |
able to do continuous batching, i.e. be able to run through the model iteratively every time, right? 00:05:23.640 |
And bring in new users. So continuous batching is one of the things that you have to have to have 00:05:28.620 |
support of. And a lot of software today, like Llama CPP, doesn't have support for that. So either 00:05:33.200 |
you need to build it yourself or, you know, contribute to an open source project that builds this 00:05:38.680 |
to enable low-cost inference, right, for, you know, models like Llama 405B, right? Another one of those 00:05:47.100 |
is disaggregated pre-fill or disaggregated batching, right? It depends on what you call it. But, you 00:05:55.560 |
know, if you go back to earlier, I was discussing pre-fill is very compute-intensive, decode is very 00:06:01.580 |
bandwidth-intensive. These are two different workloads, but when you're serving a user, right, whether it's, 00:06:07.340 |
you know, in your own app or you're using an API, what have you, right, like these users don't care that 00:06:13.660 |
it's two different workloads, right? It's one workload to them. I get tokens out, right? I 00:06:17.280 |
submit something to you and I get tokens back. But, but for anyone running the infra themselves, 00:06:22.060 |
they need to, they need to be keenly aware that these are two different workloads. 00:06:25.580 |
So one thing that a lot of people have started to do, Google's publicly said they're doing it. I 00:06:31.260 |
believe OpenID and Anthropic are also doing it. You know, other firms like Together and Fireworks have 00:06:37.340 |
hinted that they're doing this is disaggregated pre-fill, right? So once your inference volumes are 00:06:43.100 |
high enough, you don't just run inference, you know, you don't just replicate the model across 00:06:47.980 |
however many chips you have, right? Say, say it takes four model, four chips to serve Llama 405B, 00:06:53.100 |
right, in the future. You wouldn't just have, you know, if you have so many, if you have enough users, 00:06:58.460 |
you don't just go four and then eight, 16, whatever, right? You don't just replicate that across the world. 00:07:03.100 |
You actually do this thing called disaggregated pre-fill. You have one set of accelerators 00:07:07.340 |
do the pre-fill, which is very compute intensive, and then you hand it off to the other set of 00:07:11.420 |
accelerators to do decode. Now today, everyone just uses the same accelerator for that, right? H100 or 00:07:16.940 |
A100 or, you know, maybe, maybe L40 or something, but mostly H100. But there's a, there's a reason you 00:07:24.380 |
do this, right? And that big reason is that you have a lot of noisy neighbors, right? So if you've ever 00:07:29.420 |
worked in, like, CPUs or on anything in cloud computing, noisy neighbors are a huge, huge issue. 00:07:33.900 |
And actually, like, there's, it's very trivial to dramatically slow down most inference providers' 00:07:39.900 |
services. If you just send queries in a certain way, like in a, in a sort of malicious way, 00:07:45.900 |
you can, you can just slow down people's service, right? Whether that's, you know, and that'll, 00:07:51.980 |
that'll impact the user's time to first token, right? And I think that's a huge issue, right? If time to first 00:07:56.780 |
token is too long, people will just quit, right? Using your service. If, you know, the tokens per 00:08:04.700 |
second varies a lot, right? For a moment, you're getting 100 tokens per second, and then it drops 00:08:07.980 |
down to like 30, then it drops, it goes back up to 100. That's going to be really annoying to the user. 00:08:12.700 |
So, so there's a lot of things around, you know, SLA and, and reliability and all these things that 00:08:17.980 |
you have to guarantee. And so disaggregated pre-fill is, is one of the techniques to do that, 00:08:23.900 |
right? And, and so you don't want to have someone submit, you know, for example, hey, I have a 00:08:30.300 |
database and I want to submit, I want to run an LLM query across every single row in that database. 00:08:35.100 |
And I'm just going to submit it to you, my service provider, because you have this cool model or what 00:08:39.660 |
have you that's fine-tuned on some data set and whatever it is, right? If I submit 10,000 rows 00:08:44.700 |
to you at once, that's going to kill everyone else's performance, right? So, so this is one of the 00:08:48.300 |
techniques that people have for making it so, you know, that, that person who you definitely want to 00:08:53.740 |
serve doesn't impact everyone else's usage. Because once you open up your service to the real world, 00:08:59.980 |
you're not going to be able to control who's submitting what and rate limits are the most annoying 00:09:03.740 |
thing ever. So that's not the correct way to go about it. Another thing is context caching, 00:09:09.660 |
right? So Google launched this recently. They're the only one offering this today, 00:09:14.140 |
but I think this is a really big deal. Because when people talk about fine-tuning, right, of models, 00:09:18.860 |
that's great. But in reality, the best models are really expensive to fine-tune or impossible to 00:09:24.620 |
fine-tune, right? I can't go fine-tune 3.5s to net. Or fine-tuning Llama 405b is going to take, 00:09:30.700 |
you know, dozens and dozens of GPUs, right? So, so instead of that, the, the, or, you know, 00:09:36.300 |
and, and, and closed-source models generally. So Google only does closed-source models mostly for 00:09:39.660 |
the big ones, right? So Gemini 1.5 Pro, they offered this, they, they brought this recently, 00:09:44.300 |
right? Which is context caching. So instead of, you know, fine-tuning your model, why not, you know, 00:09:49.420 |
just fill out a context length of, you know, they, they offer, I think, two million now today, 00:09:53.420 |
right? Two million context length. Why not fill it out with your data there, right? You know, and, and there's a 00:09:59.820 |
couple, you know, advantages to that. One is you can use the best models, right? In the case of 00:10:04.620 |
fine-tuned models, you really are focused on like the Llama 7b or Mixtrol or Llama, you know, 70b. It's, 00:10:11.420 |
it's kind of look much lower quality models than what's available in the closed-source world. So one 00:10:16.780 |
of the things you can do is you can implement what Google has called context caching. In the, in the open 00:10:22.220 |
source world, we'll, we'll have super long context models soon enough. But economically, right? You know, 00:10:27.100 |
we talked about $15 token per million tokens output and 5 million per million tokens input. If you were 00:10:34.620 |
to have on, on, you know, the best closed-source models today, if you were to submit a prompt of 00:10:40.060 |
like, you know, a million tokens and, and most, most of the times you're looking at a document, 00:10:44.860 |
you get a query back, right? You, your, your output is very small. Almost all of the cost is just sending 00:10:49.420 |
them that document, right? So that's, that's going to really, really hurt you. So for people, you know, 00:10:54.060 |
targeting maybe like a legal AI or like, um, you know, some sort of other contract review AI, 00:10:59.180 |
a lot of these enterprise use cases, uh, pre-fill is going to dominate your cost if you're using APIs. 00:11:04.140 |
Um, and so Google has this context caching and, and open source will have it. So models you can run 00:11:09.660 |
yourself and, and others will deploy over time. Uh, but basically you don't recompute the KV cache, 00:11:15.900 |
right? The, the context length every single time. Instead you cache it, uh, but the problem is to 00:11:21.900 |
save, save that takes an, an incredible amount of memory. Um, so you don't save it in the GPU's memory, 00:11:28.060 |
right? You save it on the CPU's memory or storage. Um, and so, uh, VLLM, uh, which is an open source 00:11:35.180 |
library for inference is contributing, is building this currently. So if you're interested in contributing to 00:11:40.380 |
that, uh, check that out. Um, or if you're interested in using it, just start the project, 00:11:45.180 |
right? Um, because, you know, well, most of the models we have in the closed source today are like 00:11:49.500 |
only like 32 or 8K or 4K context length. They're coming with longer. Um, and being able to, you know, 00:11:56.300 |
dramatically reduce your costs, um, by caching the context, um, is, is very, is going to, is going to, 00:12:04.140 |
is going to dramatically reduce costs, right? Um, so now I'm just going to talk about like head in 00:12:08.940 |
the cloud stuff instead of like real usable things, which is, um, you know, what's coming down the 00:12:13.660 |
pipeline, right? Which is, you know, GPT-4 was like 20,000 chips for 90 to 100 days. Um, used, you know, 00:12:19.500 |
38 gigawatt hours. Very, very expensive. Cool. Um, but you know, what's, what is, what are they 00:12:24.300 |
building now, right? Uh, OpenAI, XAI, um, Anthropic, many others that are building 100,000 chip 00:12:30.300 |
clusters, right? And it would train GPT-4 in three days, right? So it's kind of irrelevant. 00:12:33.980 |
Um, you know, and, and, uh, I'll skip over this part, uh, because it's not really, uh, 00:12:40.300 |
too relevant. Um, but you know, what, what, what's the modern system capable of, right? Like H100 is, 00:12:46.140 |
is pretty, uh, pretty fast relative to A100 and, and coming down the pipeline is these, the new NVIDIA 00:12:51.740 |
chips. But what, what, what's come, you know, what's coming down with these 100,000 GPU clusters, 00:12:55.500 |
right? Um, it's not going to be a 1.8 trillion parameter model. It's actually going to be, you know, 00:12:59.100 |
it could be in the tens of trillions of parameters. Um, you know, the, the training flops 00:13:03.820 |
right? It talked about GPT-4 is, it's roughly two E25 flops, right? Which is, uh, you know, 00:13:11.020 |
a number that's not really relevant or two E25 flop. Um, but with a 100,000 GPU cluster, you can do 00:13:16.540 |
10 E26, 10 E27 flops, uh, and to run that model is going to require 200 gigabytes or terabytes a second 00:13:22.380 |
of memory bandwidth, right? Um, but what, what is that like, what does that look like, right? So, so this is a, 00:13:27.420 |
a, on the top right is an image of, uh, Microsoft's data centers in Arizona where they're making 00:13:34.620 |
GPT-5, right? Um, they have about a hundred thousand GPUs here. Uh, it's 150 megawatts, 00:13:40.700 |
right? Like the average home does not consume, you know, that's like, that's like, like tens of thousands, 00:13:45.980 |
if not hundreds of thousands of homes of power consumption, right? It's, it's kind of insane. 00:13:50.060 |
Um, Elon's talked about his next generation cluster. He's building a hundred thousand GPU cluster today. 00:13:54.460 |
Uh, but he's talked about his next generation cluster is 300,000 GPUs. That is kind of insane, 00:13:59.580 |
but the, the power cost for that alone would be like $500 million a year, right? So it's like, 00:14:05.260 |
you know, people are, people are kind of insane, but it's pretty cool. Um, but, you know, the, the, 00:14:11.020 |
the interesting thing here is, you know, on training, we, we, you know, when, when you, 00:14:14.540 |
when you try and train a model today, people just talk about fully connected clusters. Uh, every GPU is 00:14:20.060 |
connected to every other GPU at some speed and you, you know, you have to do, you know, all your operations, 00:14:24.700 |
but that's not really possible when you go to these super large clusters, right? Um, so the 00:14:30.220 |
hundred thousand GPU clusters, those are being built this year and the next year they're planning to build 00:14:34.140 |
multiple hundred thousand GPU clusters. Already you can see that it exists across multiple buildings, 00:14:39.180 |
right? Um, and so there's a lot of complicated networking, uh, going on, right? To connect these 00:14:46.460 |
data centers together. Um, and, and one other thing that, that I think is just like kind of interesting to, 00:14:51.340 |
again, head in the clouds just to think about is, um, when you connect these chips together, 00:14:55.820 |
there's a lot of optics, right? Uh, you know, you convert from electrical to optical, uh, and then, 00:15:01.180 |
you know, over fiber optics to connect between chips, transceivers, et cetera, right? Uh, these are 00:15:05.500 |
extremely unreliable, right? Uh, they tend to have a failure rate of around five years. Um, and so what's 00:15:12.060 |
interesting is if you're talking about a hundred thousand GPU cluster, um, or if you're talking about a 00:15:16.460 |
500,000 GPU cluster, you're going to have something fail like every five minutes, right? Um, which is 00:15:22.700 |
insane, right? How, how do you even deal with something in your cluster failing every five minutes 00:15:27.020 |
when you're training a model, right? Um, so, you know, this is, this is again more of like a hardware 00:15:32.940 |
oriented thing, but, uh, you know, the, the other thing that's interesting is like when you get chips, 00:15:38.220 |
they're not all the same speed. You know, an H100 is not an H100. Um, they're stragglers. Uh, so if you get a large 00:15:44.380 |
distribution of chips, um, what we call it in the industry is, it's called the silicon lottery. Um, 00:15:49.740 |
in that like, you know, you, you can buy, for example, a, a gaming GPU and, and compare it to 00:15:55.580 |
other people's gaming GPUs on the forums and they're actually like percentages difference in performance. 00:16:00.300 |
But when you do a massive training cluster, um, you end up with, you know, training is a synchronous 00:16:05.500 |
workload, right? You know, you, you, you update the weights, you, then you pass the gradients around, 00:16:10.620 |
right? Um, and then you, you know, then you again run through a bunch of data, uh, update the weights, 00:16:16.220 |
or pass the gradients around, update the weights, right? Um, so it's a synchronous workload. So if 00:16:20.300 |
one of them is 10% slower, then everything is 10% slower. And ByteDance had a cool paper where 00:16:25.100 |
actually they saw a 25% decrease in speed just because one random GPU they got, uh, while it did 00:16:31.020 |
technically work, um, and NVIDIA, and, and, and according to NVIDIA, it was fine. It was like 25% slower than, 00:16:37.580 |
uh, what they wanted, right? So they're, you know, this is like, this is on like a 20,000 GPU cluster 00:16:42.300 |
even, right? Um, so, so it's, uh, it's, it's quite interesting that, you know, that that's, these are 00:16:49.020 |
the problems people are running into at scale, right? So they pulled that GPU out, um, and then you can 00:16:54.300 |
sort of see their performance dramatically uplifted, right? Um, during, during training. Um, and then again, 00:17:00.540 |
this is ByteDance on a 20,000 GPU cluster. So it's, it's, um, it's a, it's a big, big issue. Um, 00:17:07.340 |
and I think, I think some of the other stuff in this presentation is not really relevant. Uh, but I 00:17:11.900 |
think, I think what are these next generation systems look like is a very, um, important question to ask 00:17:19.420 |
yourself, right? Um, you know, and what, what do I, what do I, what do I do when I deal with that, right? 00:17:24.540 |
Like, I think a lot of the scaffolding that people are building, uh, today for LLMs are dealing with, 00:17:30.620 |
you know, is, is dealing with hallucinations and things like that. And, and the hope that everyone 00:17:34.540 |
has, or at least a lot of the AGI people have is that, you know, when I, when I 100X the compute, 00:17:39.740 |
um, you know, when I build a cluster that takes $500 million of electricity and I trade a model with it, 00:17:44.780 |
it's going to make something that, uh, uh, you know, yearly electricity costs and make a model with it. 00:17:49.420 |
And then the cluster itself costs over 10 billion, by the way, right? Uh, it's, it's going to get rid of 00:17:53.180 |
a lot of this, um, the hallucinations. It's going to let us do a lot of interesting things. Um, 00:17:59.020 |
yeah, so, so I think that's, that's basically all for the talk. I just wanted to, you know, uh, mention, 00:18:04.700 |
you know, sort of a reasonable thing, which is how do you run LLM4 or 5B kind of some strategies that 00:18:09.340 |
people need to implement that aren't necessarily implemented yet, uh, in the open source that are 00:18:13.180 |
implemented at the labs. Um, but then also like, you know, what are they doing, right? Because they're not 00:18:17.580 |
worried about, you know, LLM4 or 5B capable models.