A couple different things, right? Like, you know, people have been talking about stagnation, and I don't think anyone else, anyone here sees that. But a lot of people have been talking about stagnation of models, and a lot of that has to just do with the fact that we haven't seen a big capabilities leap in the last bit.
But that comes really from models that we're using today are largely the same as the models that were trained in 2022, right? GPT-4, 4Turbo, 4.0, those are just smaller models that are trained for longer, so similar quality, right? You know, 3.5 Synet came out recently, but again, that's actually smaller than Opus, but it's somehow better because they trained it for longer, right?
But we haven't seen an extremely large model come out yet, but we will soon. But one interesting thing, right, is GPT-4 is like 1.8 trillion parameters. It's crazy, crazy expensive to run, right? 200 billion parameters. Each token requires, you know, almost 600 gigaflops, but that's almost going to be considered a last generation model, right, in a year from now.
So there's a couple of things that I wanted to talk about regarding that, right? And mostly on the inference side, because I don't think, you know, anyone here is going to try and train that kind of next generation model, but definitely we need to be able to run it.
And so, you know, a few things, right? So just going to break down inference in detail, right? You know, there's two parts of inference, right? There's pre-fill, there's decode. Pre-fill is the prompt processing, right? And the interesting thing is if you have a 2K prompt, 2K context-length prompt, right, 2,000 tokens you input into GPT, that's a petaflop itself, right?
And then, you know, if you have 32,000 prompt that you enter, it's 20 petaflops, actually. So it's an incredible amount of compute that's required to just process the prompt. And, you know, while pre-fill is very compute-intensive, right, it's actually the opposite of decode, right? Decode is actually generating each token iteratively, right?
So you process the prompt, then you generate a token, you feed it back in, and you keep going iteratively, right? And decode is extremely memory bandwidth intensive, right? You have to load the whole model from the weights, the entire, all the weights into the chip, right, or chips for decode.
And the big challenge here is that, you know, hey, if you have 1.8 trillion parameters, if you're running at a reasonable batch size, you're activating all the experts, you need to load all 1.8 trillion parameters every single token generation, right? Even if you're serving multiple users at once, that means you're, you need, you know, 1.8, you need terabytes a second of memory bandwidth.
You want to do 30 tokens per second, I think that's like a minimum bar for most people, right? A lot of people want hundreds of tokens per second, but even if you want 30 tokens per second per user, 64 users, you need 60 terabytes a second of memory bandwidth.
Even if you look at an H100, it has like three, right? So this is a extremely challenging systems problem. More, you know, while it is very bandwidth intensive, it's actually quite cheap on the compute, which is why if you look at like open AI pricing or cloud pricing, you see a three or four to one ratio between pre-fill versus decode pricing, right?
So the input tokens cost, you know, one-third that of the output tokens, or one-fourth that. So, you know, today the best models, I think, 4.0 and 3.5 Senet are, I want to say it's $15 per million tokens, and then it's $5 per million tokens for input, $15 for output, so $5 for pre-fill, $15 for decode.
And soon we're going to have, you know, in the open source, you know, so what everyone here can touch is, is Llama 3.405b, right? And that's going to be a real capability sort of unlock for the, you know, the open source market as well as, you know, builders here, right?
And I think there's a couple things that people really need to be able to implement, right? Like, you can't just run Llama CPP on Llama 4.5b, right? Like, it's just not going to work. So there's a bunch of stuff that people have to work on, you know, whether it's using, you know, closed source libraries like TensorRTLLM that only work on NVIDIA, or like VLLM, which is an open source library that works on AMD and Intel and soon other people's chips as well.
You know, there's a lot of stuff that people need to figure out. One of those is continuous batching, right? Because you're going to get, you know, running inference at batch size one is horrendously expensive. You know, it's great to run out if you're running it on your own personal devices, but if you're running it in the cloud, right, you're renting GPUs, you're running batch size one, you're going to cost yourself 10x more.
You know, 10x is a low bar, right? It's actually could be 10x to 100x more than running at a high batch, right? So you have to figure out how to run high batch sizes, batch sizes, how many concurrent users you're serving. And so one of those things that makes it difficult is that users requests come in at different times, right?
One person might send a request now, and then another person sends in a request five seconds later, but the first person's request is not done. So you need to be able to do continuous batching, i.e. be able to run through the model iteratively every time, right? And bring in new users.
So continuous batching is one of the things that you have to have to have support of. And a lot of software today, like Llama CPP, doesn't have support for that. So either you need to build it yourself or, you know, contribute to an open source project that builds this to enable low-cost inference, right, for, you know, models like Llama 405B, right?
Another one of those is disaggregated pre-fill or disaggregated batching, right? It depends on what you call it. But, you know, if you go back to earlier, I was discussing pre-fill is very compute-intensive, decode is very bandwidth-intensive. These are two different workloads, but when you're serving a user, right, whether it's, you know, in your own app or you're using an API, what have you, right, like these users don't care that it's two different workloads, right?
It's one workload to them. I get tokens out, right? I submit something to you and I get tokens back. But, but for anyone running the infra themselves, they need to, they need to be keenly aware that these are two different workloads. So one thing that a lot of people have started to do, Google's publicly said they're doing it.
I believe OpenID and Anthropic are also doing it. You know, other firms like Together and Fireworks have hinted that they're doing this is disaggregated pre-fill, right? So once your inference volumes are high enough, you don't just run inference, you know, you don't just replicate the model across however many chips you have, right?
Say, say it takes four model, four chips to serve Llama 405B, right, in the future. You wouldn't just have, you know, if you have so many, if you have enough users, you don't just go four and then eight, 16, whatever, right? You don't just replicate that across the world.
You actually do this thing called disaggregated pre-fill. You have one set of accelerators do the pre-fill, which is very compute intensive, and then you hand it off to the other set of accelerators to do decode. Now today, everyone just uses the same accelerator for that, right? H100 or A100 or, you know, maybe, maybe L40 or something, but mostly H100.
But there's a, there's a reason you do this, right? And that big reason is that you have a lot of noisy neighbors, right? So if you've ever worked in, like, CPUs or on anything in cloud computing, noisy neighbors are a huge, huge issue. And actually, like, there's, it's very trivial to dramatically slow down most inference providers' services.
If you just send queries in a certain way, like in a, in a sort of malicious way, you can, you can just slow down people's service, right? Whether that's, you know, and that'll, that'll impact the user's time to first token, right? And I think that's a huge issue, right?
If time to first token is too long, people will just quit, right? Using your service. If, you know, the tokens per second varies a lot, right? For a moment, you're getting 100 tokens per second, and then it drops down to like 30, then it drops, it goes back up to 100.
That's going to be really annoying to the user. So, so there's a lot of things around, you know, SLA and, and reliability and all these things that you have to guarantee. And so disaggregated pre-fill is, is one of the techniques to do that, right? And, and so you don't want to have someone submit, you know, for example, hey, I have a database and I want to submit, I want to run an LLM query across every single row in that database.
And I'm just going to submit it to you, my service provider, because you have this cool model or what have you that's fine-tuned on some data set and whatever it is, right? If I submit 10,000 rows to you at once, that's going to kill everyone else's performance, right? So, so this is one of the techniques that people have for making it so, you know, that, that person who you definitely want to serve doesn't impact everyone else's usage.
Because once you open up your service to the real world, you're not going to be able to control who's submitting what and rate limits are the most annoying thing ever. So that's not the correct way to go about it. Another thing is context caching, right? So Google launched this recently.
They're the only one offering this today, but I think this is a really big deal. Because when people talk about fine-tuning, right, of models, that's great. But in reality, the best models are really expensive to fine-tune or impossible to fine-tune, right? I can't go fine-tune 3.5s to net. Or fine-tuning Llama 405b is going to take, you know, dozens and dozens of GPUs, right?
So, so instead of that, the, the, or, you know, and, and, and closed-source models generally. So Google only does closed-source models mostly for the big ones, right? So Gemini 1.5 Pro, they offered this, they, they brought this recently, right? Which is context caching. So instead of, you know, fine-tuning your model, why not, you know, just fill out a context length of, you know, they, they offer, I think, two million now today, right?
Two million context length. Why not fill it out with your data there, right? You know, and, and there's a couple, you know, advantages to that. One is you can use the best models, right? In the case of fine-tuned models, you really are focused on like the Llama 7b or Mixtrol or Llama, you know, 70b.
It's, it's kind of look much lower quality models than what's available in the closed-source world. So one of the things you can do is you can implement what Google has called context caching. In the, in the open source world, we'll, we'll have super long context models soon enough. But economically, right?
You know, we talked about $15 token per million tokens output and 5 million per million tokens input. If you were to have on, on, you know, the best closed-source models today, if you were to submit a prompt of like, you know, a million tokens and, and most, most of the times you're looking at a document, you get a query back, right?
You, your, your output is very small. Almost all of the cost is just sending them that document, right? So that's, that's going to really, really hurt you. So for people, you know, targeting maybe like a legal AI or like, um, you know, some sort of other contract review AI, a lot of these enterprise use cases, uh, pre-fill is going to dominate your cost if you're using APIs.
Um, and so Google has this context caching and, and open source will have it. So models you can run yourself and, and others will deploy over time. Uh, but basically you don't recompute the KV cache, right? The, the context length every single time. Instead you cache it, uh, but the problem is to save, save that takes an, an incredible amount of memory.
Um, so you don't save it in the GPU's memory, right? You save it on the CPU's memory or storage. Um, and so, uh, VLLM, uh, which is an open source library for inference is contributing, is building this currently. So if you're interested in contributing to that, uh, check that out.
Um, or if you're interested in using it, just start the project, right? Um, because, you know, well, most of the models we have in the closed source today are like only like 32 or 8K or 4K context length. They're coming with longer. Um, and being able to, you know, dramatically reduce your costs, um, by caching the context, um, is, is very, is going to, is going to, is going to dramatically reduce costs, right?
Um, so now I'm just going to talk about like head in the cloud stuff instead of like real usable things, which is, um, you know, what's coming down the pipeline, right? Which is, you know, GPT-4 was like 20,000 chips for 90 to 100 days. Um, used, you know, 38 gigawatt hours.
Very, very expensive. Cool. Um, but you know, what's, what is, what are they building now, right? Uh, OpenAI, XAI, um, Anthropic, many others that are building 100,000 chip clusters, right? And it would train GPT-4 in three days, right? So it's kind of irrelevant. Um, you know, and, and, uh, I'll skip over this part, uh, because it's not really, uh, too relevant.
Um, but you know, what, what, what's the modern system capable of, right? Like H100 is, is pretty, uh, pretty fast relative to A100 and, and coming down the pipeline is these, the new NVIDIA chips. But what, what, what's come, you know, what's coming down with these 100,000 GPU clusters, right?
Um, it's not going to be a 1.8 trillion parameter model. It's actually going to be, you know, it could be in the tens of trillions of parameters. Um, you know, the, the training flops right? It talked about GPT-4 is, it's roughly two E25 flops, right? Which is, uh, you know, a number that's not really relevant or two E25 flop.
Um, but with a 100,000 GPU cluster, you can do 10 E26, 10 E27 flops, uh, and to run that model is going to require 200 gigabytes or terabytes a second of memory bandwidth, right? Um, but what, what is that like, what does that look like, right? So, so this is a, a, on the top right is an image of, uh, Microsoft's data centers in Arizona where they're making GPT-5, right?
Um, they have about a hundred thousand GPUs here. Uh, it's 150 megawatts, right? Like the average home does not consume, you know, that's like, that's like, like tens of thousands, if not hundreds of thousands of homes of power consumption, right? It's, it's kind of insane. Um, Elon's talked about his next generation cluster.
He's building a hundred thousand GPU cluster today. Uh, but he's talked about his next generation cluster is 300,000 GPUs. That is kind of insane, but the, the power cost for that alone would be like $500 million a year, right? So it's like, you know, people are, people are kind of insane, but it's pretty cool.
Um, but, you know, the, the, the interesting thing here is, you know, on training, we, we, you know, when, when you, when you try and train a model today, people just talk about fully connected clusters. Uh, every GPU is connected to every other GPU at some speed and you, you know, you have to do, you know, all your operations, but that's not really possible when you go to these super large clusters, right?
Um, so the hundred thousand GPU clusters, those are being built this year and the next year they're planning to build multiple hundred thousand GPU clusters. Already you can see that it exists across multiple buildings, right? Um, and so there's a lot of complicated networking, uh, going on, right? To connect these data centers together.
Um, and, and one other thing that, that I think is just like kind of interesting to, again, head in the clouds just to think about is, um, when you connect these chips together, there's a lot of optics, right? Uh, you know, you convert from electrical to optical, uh, and then, you know, over fiber optics to connect between chips, transceivers, et cetera, right?
Uh, these are extremely unreliable, right? Uh, they tend to have a failure rate of around five years. Um, and so what's interesting is if you're talking about a hundred thousand GPU cluster, um, or if you're talking about a 500,000 GPU cluster, you're going to have something fail like every five minutes, right?
Um, which is insane, right? How, how do you even deal with something in your cluster failing every five minutes when you're training a model, right? Um, so, you know, this is, this is again more of like a hardware oriented thing, but, uh, you know, the, the other thing that's interesting is like when you get chips, they're not all the same speed.
You know, an H100 is not an H100. Um, they're stragglers. Uh, so if you get a large distribution of chips, um, what we call it in the industry is, it's called the silicon lottery. Um, in that like, you know, you, you can buy, for example, a, a gaming GPU and, and compare it to other people's gaming GPUs on the forums and they're actually like percentages difference in performance.
But when you do a massive training cluster, um, you end up with, you know, training is a synchronous workload, right? You know, you, you, you update the weights, you, then you pass the gradients around, right? Um, and then you, you know, then you again run through a bunch of data, uh, update the weights, or pass the gradients around, update the weights, right?
Um, so it's a synchronous workload. So if one of them is 10% slower, then everything is 10% slower. And ByteDance had a cool paper where actually they saw a 25% decrease in speed just because one random GPU they got, uh, while it did technically work, um, and NVIDIA, and, and, and according to NVIDIA, it was fine.
It was like 25% slower than, uh, what they wanted, right? So they're, you know, this is like, this is on like a 20,000 GPU cluster even, right? Um, so, so it's, uh, it's, it's quite interesting that, you know, that that's, these are the problems people are running into at scale, right?
So they pulled that GPU out, um, and then you can sort of see their performance dramatically uplifted, right? Um, during, during training. Um, and then again, this is ByteDance on a 20,000 GPU cluster. So it's, it's, um, it's a, it's a big, big issue. Um, and I think, I think some of the other stuff in this presentation is not really relevant.
Uh, but I think, I think what are these next generation systems look like is a very, um, important question to ask yourself, right? Um, you know, and what, what do I, what do I, what do I do when I deal with that, right? Like, I think a lot of the scaffolding that people are building, uh, today for LLMs are dealing with, you know, is, is dealing with hallucinations and things like that.
And, and the hope that everyone has, or at least a lot of the AGI people have is that, you know, when I, when I 100X the compute, um, you know, when I build a cluster that takes $500 million of electricity and I trade a model with it, it's going to make something that, uh, uh, you know, yearly electricity costs and make a model with it.
And then the cluster itself costs over 10 billion, by the way, right? Uh, it's, it's going to get rid of a lot of this, um, the hallucinations. It's going to let us do a lot of interesting things. Um, yeah, so, so I think that's, that's basically all for the talk.
I just wanted to, you know, uh, mention, you know, sort of a reasonable thing, which is how do you run LLM4 or 5B kind of some strategies that people need to implement that aren't necessarily implemented yet, uh, in the open source that are implemented at the labs. Um, but then also like, you know, what are they doing, right?
Because they're not worried about, you know, LLM4 or 5B capable models.