System Design for Next-Gen Frontier Models — Dylan Patel, SemiAnalysis

00:00:00.000 | A couple different things, right? Like, you know, people have been talking about stagnation,

00:00:16.860 | and I don't think anyone else, anyone here sees that. But a lot of people have been talking about

00:00:22.900 | stagnation of models, and a lot of that has to just do with the fact that we haven't seen

00:00:27.820 | a big capabilities leap in the last bit. But that comes really from models that we're using today are

00:00:36.220 | largely the same as the models that were trained in 2022, right? GPT-4, 4Turbo, 4.0, those are just

00:00:42.420 | smaller models that are trained for longer, so similar quality, right? You know, 3.5 Synet came

00:00:47.420 | out recently, but again, that's actually smaller than Opus, but it's somehow better because they

00:00:51.760 | trained it for longer, right? But we haven't seen an extremely large model come out yet, but we will

00:00:56.900 | soon. But one interesting thing, right, is GPT-4 is like 1.8 trillion parameters. It's crazy,

00:01:01.720 | crazy expensive to run, right? 200 billion parameters. Each token requires, you know,

00:01:08.180 | almost 600 gigaflops, but that's almost going to be considered a last generation model, right,

00:01:14.480 | in a year from now. So there's a couple of things that I wanted to talk about regarding that, right?

00:01:18.900 | And mostly on the inference side, because I don't think, you know, anyone here is going to try and

00:01:23.120 | train that kind of next generation model, but definitely we need to be able to run it.

00:01:28.120 | And so, you know, a few things, right? So just going to break down inference in detail,

00:01:33.440 | right? You know, there's two parts of inference, right? There's pre-fill, there's decode. Pre-fill

00:01:38.820 | is the prompt processing, right? And the interesting thing is if you have a 2K prompt, 2K context-length

00:01:45.360 | prompt, right, 2,000 tokens you input into GPT, that's a petaflop itself, right? And then, you know,

00:01:52.340 | if you have 32,000 prompt that you enter, it's 20 petaflops, actually. So it's an incredible amount

00:01:56.840 | of compute that's required to just process the prompt. And, you know, while pre-fill is very

00:02:04.860 | compute-intensive, right, it's actually the opposite of decode, right? Decode is actually generating

00:02:09.300 | each token iteratively, right? So you process the prompt, then you generate a token, you feed it back

00:02:14.720 | in, and you keep going iteratively, right? And decode is extremely memory bandwidth intensive,

00:02:19.860 | right? You have to load the whole model from the weights, the entire, all the weights into the chip,

00:02:26.800 | right, or chips for decode. And the big challenge here is that, you know, hey, if you have 1.8 trillion

00:02:34.280 | parameters, if you're running at a reasonable batch size, you're activating all the experts,

00:02:37.420 | you need to load all 1.8 trillion parameters every single token generation, right? Even if you're

00:02:44.460 | serving multiple users at once, that means you're, you need, you know, 1.8, you need terabytes a second

00:02:51.540 | of memory bandwidth. You want to do 30 tokens per second, I think that's like a minimum bar for most

00:02:55.760 | people, right? A lot of people want hundreds of tokens per second, but even if you want 30 tokens

00:03:00.960 | per second per user, 64 users, you need 60 terabytes a second of memory bandwidth. Even if you look at an

00:03:06.960 | H100, it has like three, right? So this is a extremely challenging systems problem. More, you know,

00:03:14.200 | while it is very bandwidth intensive, it's actually quite cheap on the compute, which is why if you

00:03:19.220 | look at like open AI pricing or cloud pricing, you see a three or four to one ratio between pre-fill

00:03:25.060 | versus decode pricing, right? So the input tokens cost, you know, one-third that of the output tokens,

00:03:31.940 | or one-fourth that. So, you know, today the best models, I think, 4.0 and 3.5 Senet are,

00:03:39.140 | I want to say it's $15 per million tokens, and then it's $5 per million tokens for input,

00:03:44.160 | $15 for output, so $5 for pre-fill, $15 for decode. And soon we're going to have, you know,

00:03:52.000 | in the open source, you know, so what everyone here can touch is, is Llama 3.405b, right? And that's

00:03:57.060 | going to be a real capability sort of unlock for the, you know, the open source market as well as,

00:04:04.420 | you know, builders here, right? And I think there's a couple things that people really need to be able

00:04:09.380 | to implement, right? Like, you can't just run Llama CPP on Llama 4.5b, right? Like, it's just not going

00:04:14.840 | to work. So there's a bunch of stuff that people have to work on, you know, whether it's using, you

00:04:20.560 | know, closed source libraries like TensorRTLLM that only work on NVIDIA, or like VLLM, which is an open

00:04:26.240 | source library that works on AMD and Intel and soon other people's chips as well. You know, there's a lot

00:04:33.860 | of stuff that people need to figure out. One of those is continuous batching, right? Because you're

00:04:38.300 | going to get, you know, running inference at batch size one is horrendously expensive. You know, it's

00:04:43.800 | great to run out if you're running it on your own personal devices, but if you're running it in the

00:04:47.560 | cloud, right, you're renting GPUs, you're running batch size one, you're going to cost yourself 10x more.

00:04:52.840 | You know, 10x is a low bar, right? It's actually could be 10x to 100x more than running at a high

00:04:57.860 | batch, right? So you have to figure out how to run high batch sizes, batch sizes, how many concurrent

00:05:03.020 | users you're serving. And so one of those things that makes it difficult is that users requests

00:05:08.460 | come in at different times, right? One person might send a request now, and then another person sends

00:05:13.080 | in a request five seconds later, but the first person's request is not done. So you need to be

00:05:17.240 | able to do continuous batching, i.e. be able to run through the model iteratively every time, right?

00:05:23.640 | And bring in new users. So continuous batching is one of the things that you have to have to have

00:05:28.620 | support of. And a lot of software today, like Llama CPP, doesn't have support for that. So either

00:05:33.200 | you need to build it yourself or, you know, contribute to an open source project that builds this

00:05:38.680 | to enable low-cost inference, right, for, you know, models like Llama 405B, right? Another one of those

00:05:47.100 | is disaggregated pre-fill or disaggregated batching, right? It depends on what you call it. But, you

00:05:55.560 | know, if you go back to earlier, I was discussing pre-fill is very compute-intensive, decode is very

00:06:01.580 | bandwidth-intensive. These are two different workloads, but when you're serving a user, right, whether it's,

00:06:07.340 | you know, in your own app or you're using an API, what have you, right, like these users don't care that

00:06:13.660 | it's two different workloads, right? It's one workload to them. I get tokens out, right? I

00:06:17.280 | submit something to you and I get tokens back. But, but for anyone running the infra themselves,

00:06:22.060 | they need to, they need to be keenly aware that these are two different workloads.

00:06:25.580 | So one thing that a lot of people have started to do, Google's publicly said they're doing it. I

00:06:31.260 | believe OpenID and Anthropic are also doing it. You know, other firms like Together and Fireworks have

00:06:37.340 | hinted that they're doing this is disaggregated pre-fill, right? So once your inference volumes are

00:06:43.100 | high enough, you don't just run inference, you know, you don't just replicate the model across

00:06:47.980 | however many chips you have, right? Say, say it takes four model, four chips to serve Llama 405B,

00:06:53.100 | right, in the future. You wouldn't just have, you know, if you have so many, if you have enough users,

00:06:58.460 | you don't just go four and then eight, 16, whatever, right? You don't just replicate that across the world.

00:07:03.100 | You actually do this thing called disaggregated pre-fill. You have one set of accelerators

00:07:07.340 | do the pre-fill, which is very compute intensive, and then you hand it off to the other set of

00:07:11.420 | accelerators to do decode. Now today, everyone just uses the same accelerator for that, right? H100 or

00:07:16.940 | A100 or, you know, maybe, maybe L40 or something, but mostly H100. But there's a, there's a reason you

00:07:24.380 | do this, right? And that big reason is that you have a lot of noisy neighbors, right? So if you've ever

00:07:29.420 | worked in, like, CPUs or on anything in cloud computing, noisy neighbors are a huge, huge issue.

00:07:33.900 | And actually, like, there's, it's very trivial to dramatically slow down most inference providers'

00:07:39.900 | services. If you just send queries in a certain way, like in a, in a sort of malicious way,

00:07:45.900 | you can, you can just slow down people's service, right? Whether that's, you know, and that'll,

00:07:51.980 | that'll impact the user's time to first token, right? And I think that's a huge issue, right? If time to first

00:07:56.780 | token is too long, people will just quit, right? Using your service. If, you know, the tokens per

00:08:04.700 | second varies a lot, right? For a moment, you're getting 100 tokens per second, and then it drops

00:08:07.980 | down to like 30, then it drops, it goes back up to 100. That's going to be really annoying to the user.

00:08:12.700 | So, so there's a lot of things around, you know, SLA and, and reliability and all these things that

00:08:17.980 | you have to guarantee. And so disaggregated pre-fill is, is one of the techniques to do that,

00:08:23.900 | right? And, and so you don't want to have someone submit, you know, for example, hey, I have a

00:08:30.300 | database and I want to submit, I want to run an LLM query across every single row in that database.

00:08:35.100 | And I'm just going to submit it to you, my service provider, because you have this cool model or what

00:08:39.660 | have you that's fine-tuned on some data set and whatever it is, right? If I submit 10,000 rows

00:08:44.700 | to you at once, that's going to kill everyone else's performance, right? So, so this is one of the

00:08:48.300 | techniques that people have for making it so, you know, that, that person who you definitely want to

00:08:53.740 | serve doesn't impact everyone else's usage. Because once you open up your service to the real world,

00:08:59.980 | you're not going to be able to control who's submitting what and rate limits are the most annoying

00:09:03.740 | thing ever. So that's not the correct way to go about it. Another thing is context caching,

00:09:09.660 | right? So Google launched this recently. They're the only one offering this today,

00:09:14.140 | but I think this is a really big deal. Because when people talk about fine-tuning, right, of models,

00:09:18.860 | that's great. But in reality, the best models are really expensive to fine-tune or impossible to

00:09:24.620 | fine-tune, right? I can't go fine-tune 3.5s to net. Or fine-tuning Llama 405b is going to take,

00:09:30.700 | you know, dozens and dozens of GPUs, right? So, so instead of that, the, the, or, you know,

00:09:36.300 | and, and, and closed-source models generally. So Google only does closed-source models mostly for

00:09:39.660 | the big ones, right? So Gemini 1.5 Pro, they offered this, they, they brought this recently,

00:09:44.300 | right? Which is context caching. So instead of, you know, fine-tuning your model, why not, you know,

00:09:49.420 | just fill out a context length of, you know, they, they offer, I think, two million now today,

00:09:53.420 | right? Two million context length. Why not fill it out with your data there, right? You know, and, and there's a

00:09:59.820 | couple, you know, advantages to that. One is you can use the best models, right? In the case of

00:10:04.620 | fine-tuned models, you really are focused on like the Llama 7b or Mixtrol or Llama, you know, 70b. It's,

00:10:11.420 | it's kind of look much lower quality models than what's available in the closed-source world. So one

00:10:16.780 | of the things you can do is you can implement what Google has called context caching. In the, in the open

00:10:22.220 | source world, we'll, we'll have super long context models soon enough. But economically, right? You know,

00:10:27.100 | we talked about $15 token per million tokens output and 5 million per million tokens input. If you were

00:10:34.620 | to have on, on, you know, the best closed-source models today, if you were to submit a prompt of

00:10:40.060 | like, you know, a million tokens and, and most, most of the times you're looking at a document,

00:10:44.860 | you get a query back, right? You, your, your output is very small. Almost all of the cost is just sending

00:10:49.420 | them that document, right? So that's, that's going to really, really hurt you. So for people, you know,

00:10:54.060 | targeting maybe like a legal AI or like, um, you know, some sort of other contract review AI,

00:10:59.180 | a lot of these enterprise use cases, uh, pre-fill is going to dominate your cost if you're using APIs.

00:11:04.140 | Um, and so Google has this context caching and, and open source will have it. So models you can run

00:11:09.660 | yourself and, and others will deploy over time. Uh, but basically you don't recompute the KV cache,

00:11:15.900 | right? The, the context length every single time. Instead you cache it, uh, but the problem is to

00:11:21.900 | save, save that takes an, an incredible amount of memory. Um, so you don't save it in the GPU's memory,

00:11:28.060 | right? You save it on the CPU's memory or storage. Um, and so, uh, VLLM, uh, which is an open source

00:11:35.180 | library for inference is contributing, is building this currently. So if you're interested in contributing to

00:11:40.380 | that, uh, check that out. Um, or if you're interested in using it, just start the project,

00:11:45.180 | right? Um, because, you know, well, most of the models we have in the closed source today are like

00:11:49.500 | only like 32 or 8K or 4K context length. They're coming with longer. Um, and being able to, you know,

00:11:56.300 | dramatically reduce your costs, um, by caching the context, um, is, is very, is going to, is going to,

00:12:04.140 | is going to dramatically reduce costs, right? Um, so now I'm just going to talk about like head in

00:12:08.940 | the cloud stuff instead of like real usable things, which is, um, you know, what's coming down the

00:12:13.660 | pipeline, right? Which is, you know, GPT-4 was like 20,000 chips for 90 to 100 days. Um, used, you know,

00:12:19.500 | 38 gigawatt hours. Very, very expensive. Cool. Um, but you know, what's, what is, what are they

00:12:24.300 | building now, right? Uh, OpenAI, XAI, um, Anthropic, many others that are building 100,000 chip

00:12:30.300 | clusters, right? And it would train GPT-4 in three days, right? So it's kind of irrelevant.

00:12:33.980 | Um, you know, and, and, uh, I'll skip over this part, uh, because it's not really, uh,

00:12:40.300 | too relevant. Um, but you know, what, what, what's the modern system capable of, right? Like H100 is,

00:12:46.140 | is pretty, uh, pretty fast relative to A100 and, and coming down the pipeline is these, the new NVIDIA

00:12:51.740 | chips. But what, what, what's come, you know, what's coming down with these 100,000 GPU clusters,

00:12:55.500 | right? Um, it's not going to be a 1.8 trillion parameter model. It's actually going to be, you know,

00:12:59.100 | it could be in the tens of trillions of parameters. Um, you know, the, the training flops

00:13:03.820 | right? It talked about GPT-4 is, it's roughly two E25 flops, right? Which is, uh, you know,

00:13:11.020 | a number that's not really relevant or two E25 flop. Um, but with a 100,000 GPU cluster, you can do

00:13:16.540 | 10 E26, 10 E27 flops, uh, and to run that model is going to require 200 gigabytes or terabytes a second

00:13:22.380 | of memory bandwidth, right? Um, but what, what is that like, what does that look like, right? So, so this is a,

00:13:27.420 | a, on the top right is an image of, uh, Microsoft's data centers in Arizona where they're making

00:13:34.620 | GPT-5, right? Um, they have about a hundred thousand GPUs here. Uh, it's 150 megawatts,

00:13:40.700 | right? Like the average home does not consume, you know, that's like, that's like, like tens of thousands,

00:13:45.980 | if not hundreds of thousands of homes of power consumption, right? It's, it's kind of insane.

00:13:50.060 | Um, Elon's talked about his next generation cluster. He's building a hundred thousand GPU cluster today.

00:13:54.460 | Uh, but he's talked about his next generation cluster is 300,000 GPUs. That is kind of insane,

00:13:59.580 | but the, the power cost for that alone would be like $500 million a year, right? So it's like,

00:14:05.260 | you know, people are, people are kind of insane, but it's pretty cool. Um, but, you know, the, the,

00:14:11.020 | the interesting thing here is, you know, on training, we, we, you know, when, when you,

00:14:14.540 | when you try and train a model today, people just talk about fully connected clusters. Uh, every GPU is

00:14:20.060 | connected to every other GPU at some speed and you, you know, you have to do, you know, all your operations,

00:14:24.700 | but that's not really possible when you go to these super large clusters, right? Um, so the

00:14:30.220 | hundred thousand GPU clusters, those are being built this year and the next year they're planning to build

00:14:34.140 | multiple hundred thousand GPU clusters. Already you can see that it exists across multiple buildings,

00:14:39.180 | right? Um, and so there's a lot of complicated networking, uh, going on, right? To connect these

00:14:46.460 | data centers together. Um, and, and one other thing that, that I think is just like kind of interesting to,

00:14:51.340 | again, head in the clouds just to think about is, um, when you connect these chips together,

00:14:55.820 | there's a lot of optics, right? Uh, you know, you convert from electrical to optical, uh, and then,

00:15:01.180 | you know, over fiber optics to connect between chips, transceivers, et cetera, right? Uh, these are

00:15:05.500 | extremely unreliable, right? Uh, they tend to have a failure rate of around five years. Um, and so what's

00:15:12.060 | interesting is if you're talking about a hundred thousand GPU cluster, um, or if you're talking about a

00:15:16.460 | 500,000 GPU cluster, you're going to have something fail like every five minutes, right? Um, which is

00:15:22.700 | insane, right? How, how do you even deal with something in your cluster failing every five minutes

00:15:27.020 | when you're training a model, right? Um, so, you know, this is, this is again more of like a hardware

00:15:32.940 | oriented thing, but, uh, you know, the, the other thing that's interesting is like when you get chips,

00:15:38.220 | they're not all the same speed. You know, an H100 is not an H100. Um, they're stragglers. Uh, so if you get a large

00:15:44.380 | distribution of chips, um, what we call it in the industry is, it's called the silicon lottery. Um,

00:15:49.740 | in that like, you know, you, you can buy, for example, a, a gaming GPU and, and compare it to

00:15:55.580 | other people's gaming GPUs on the forums and they're actually like percentages difference in performance.

00:16:00.300 | But when you do a massive training cluster, um, you end up with, you know, training is a synchronous

00:16:05.500 | workload, right? You know, you, you, you update the weights, you, then you pass the gradients around,

00:16:10.620 | right? Um, and then you, you know, then you again run through a bunch of data, uh, update the weights,

00:16:16.220 | or pass the gradients around, update the weights, right? Um, so it's a synchronous workload. So if

00:16:20.300 | one of them is 10% slower, then everything is 10% slower. And ByteDance had a cool paper where

00:16:25.100 | actually they saw a 25% decrease in speed just because one random GPU they got, uh, while it did

00:16:31.020 | technically work, um, and NVIDIA, and, and, and according to NVIDIA, it was fine. It was like 25% slower than,

00:16:37.580 | uh, what they wanted, right? So they're, you know, this is like, this is on like a 20,000 GPU cluster

00:16:42.300 | even, right? Um, so, so it's, uh, it's, it's quite interesting that, you know, that that's, these are

00:16:49.020 | the problems people are running into at scale, right? So they pulled that GPU out, um, and then you can

00:16:54.300 | sort of see their performance dramatically uplifted, right? Um, during, during training. Um, and then again,

00:17:00.540 | this is ByteDance on a 20,000 GPU cluster. So it's, it's, um, it's a, it's a big, big issue. Um,

00:17:07.340 | and I think, I think some of the other stuff in this presentation is not really relevant. Uh, but I

00:17:11.900 | think, I think what are these next generation systems look like is a very, um, important question to ask

00:17:19.420 | yourself, right? Um, you know, and what, what do I, what do I, what do I do when I deal with that, right?

00:17:24.540 | Like, I think a lot of the scaffolding that people are building, uh, today for LLMs are dealing with,

00:17:30.620 | you know, is, is dealing with hallucinations and things like that. And, and the hope that everyone

00:17:34.540 | has, or at least a lot of the AGI people have is that, you know, when I, when I 100X the compute,

00:17:39.740 | um, you know, when I build a cluster that takes $500 million of electricity and I trade a model with it,

00:17:44.780 | it's going to make something that, uh, uh, you know, yearly electricity costs and make a model with it.

00:17:49.420 | And then the cluster itself costs over 10 billion, by the way, right? Uh, it's, it's going to get rid of

00:17:53.180 | a lot of this, um, the hallucinations. It's going to let us do a lot of interesting things. Um,

00:17:59.020 | yeah, so, so I think that's, that's basically all for the talk. I just wanted to, you know, uh, mention,

00:18:04.700 | you know, sort of a reasonable thing, which is how do you run LLM4 or 5B kind of some strategies that

00:18:09.340 | people need to implement that aren't necessarily implemented yet, uh, in the open source that are

00:18:13.180 | implemented at the labs. Um, but then also like, you know, what are they doing, right? Because they're not

00:18:17.580 | worried about, you know, LLM4 or 5B capable models.