Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

- - It's very difficult to teach extremely technical material in about 20 minutes. Initially, I had planned for at least a 45-minute session, so I left some reading material for you at the end. And all of the resources, you could download slides and everything, so feel free to take screenshots or not.

And so I work at NVIDIA, I'm a solutions architect, so I work primarily with retail clients, and it's my job to essentially work with those clients, understand sort of what their main challenges are. This is data processing, computer vision, across all of the different use cases, and then now I'm focused on LLM inference.

So my hope today is that you get a better intuition of exactly what's happening with this particular workload and how you go about, to some degree, sizing things, choosing different GPUs, et cetera, et cetera, and more importantly, controlling the cost of a deployment, 'cause that's oftentimes the thing that's going to really prevent you from taking this to any meaningful scale is that overall cost of a deployment.

Most folks that I've seen are doing some kind of hybrid, so you choose a big box API, you have some set of queries that go there. In addition, you have some set of queries that go to some open-source hosted model or some fine-tuned model that you have internally. So just reference, if you go to build.nvidia.com or ai.nvidia.com, everyone can get 1,000 inference requests for free.

So I typically recommend this to folks who are benchmarking different types of open-source models. We have all of those models hosted. It's optimized. If you're teaching a course and you are trying to evaluate all of the different LLMs that are out there for your business, there are also multi-modal LLMs, speech LLMs.

Every model that NVIDIA accelerates will be available there for you. And that's sort of a path to you to either go optimize them yourselves or to work with us. You'll see things about NVIDIA inference microservice and all of those things that you can take to enterprise. So we have sometimes the, I'll call it the rocky road, and then there's smooth roads.

Whatever path you want to take, we're here to support you. In terms of agenda, very simple. I want you to understand the LLM inference workload, and then we'll move to how you go about measuring a production deployment and some of the things you need to be watching. It's a little more than, let's say, you know, the total time to generation and really understanding what's happening on the GPUs as you sort of scale out.

Even if you have a single GPU, I think it's very important for you to just have that intuition. And then lastly, I'll show you some software that you can use, some open source packages that you can use, and then point to some paid offerings. Okay, we're going to get into the LLM inference workload itself.

So the first part is really understanding what happens when you send a prompt onto the GPU. So I have this example here. I'm saying, okay, write me a presentation so I sound smart. I come to the AI engineer conference, and you guys are maybe going to like to talk.

And essentially what I'm going to do is I'm going to put that on the GPU. So the moment that I send that prompt on the GPU, it stays on the GPU. So think about that. And then from there, I'm going to generate one token at a time. So I'm generating the tokens, LLM inference is hard, and I put the timestamps T1 through T4.

So in every single deployment, no matter how fast anyone claims they're doing things, it's typically one token that's generated at a time. that's very important to understand. The next thing is that in order for an LLM to give you a coherent answer, just like how you speak, you have to remember every single thing that you said before, and that you'll understand the mechanism of how LLMs are able to do that.

So that's why I'm putting LLM inference is in red and putting that back onto the GPU. So every token that I generate gets locked onto the GPU, and then you'll actually see what that looks like in terms of vectors. How many of you have heard of KvCache before? Okay, some of you.

Typically, I don't see maybe many leaders hear about this thing called KvCache. KvCache is this thing that really drives to some degree the cost. So whether or not you use some big box API or you're using a single GPU, it's all the same sort of mechanisms, the same algorithm that everyone is trying to solve.

So in terms of steps, here I like to, as I said, sharpen your intuition. So the first thing, if we move from the left, my first job is to convert these texts, whatever texts that you send, we're going to focus on LLM inference, into some words that the model understands.

So the model will have its own vocabulary, and it's my job to translate that. I'll give you the technical terms coming up after that. And the first thing that happens is I do some initial prompt processing. So I have to compute the attention mechanism on the entire prompt. I repeat that.

I have to compute the attention mechanism on the entire prompt per user. So if I have a million people hitting my service and a million people send 10,000 tokens, that's a million times 10,000 attention mechanisms that I need to compute, also while generating tokens for other people. So it's good for you to appreciate sort of that complexity that's happening.

And once I finish processing that prompt, then I'm going to start generating one token at a time. And that typically happens very fast. And then from there, every token that gets generated, that's in the LLM's vocabulary, I need to now de-tokenize that back into your language. So here's the technical terms that you'll see when you read the literature, or you read super technical documents.

First is tokenization. Each model will have its own tokenizer. And the thing to think about, when you think of tokenizers, when they did pre-training, they downloaded the internet and some, right? And they cleaned it up, et cetera, et cetera. So tokenizer, and as you start thinking of the complexity across languages, coding languages, regions, et cetera, et cetera, they tried to get what is the minimal set of character groups that can represent this entire training data set efficiently.

Because it's really all about efficiency. So for instance, the LLAMA tokenizer has 128,000 tokens, all right? And I'll talk a bit more about that. So here's what it actually kind of looks like on a GPU. So I tokenize, the LLAMA understands it. I go into this thing called pre-fill.

Pre-fill is the stage where you compute the attention mechanism. And many people are doing advancements with attention mechanisms. I'll talk a bit more about that. So there are tons of different schemes. People leverage all of the different types of memory hierarchies in GPUs to really accelerate this type of workload.

And then I start generating tokens one at a time. The red and the green just signify, hey, I'm storing those tokens on the GPU. The green is the latest one that I sent out. So hopefully that makes sort of intuitive sense. All right. So the other thing I want you to visualize, I think it's nice to visualize what is the actual data that sits on the GPU.

So first, a token is approximately four characters. That's a nice way for you to think about it. So from here, I have two vectors. So the first vector is just showing token one through token V. V is the total number of tokens that I have in my tokenizer. And the second vector below is just I have some numeric index.

I don't want to keep using the token to reference itself. I just use the number as a lookup. So my job when a prompt comes in is to convert that text into those token, what I'm going to call token IDs. So I have make me sound smart and make me sound smarter.

You see two vector sets of tokens. And the key thing I want you to walk away from that distinction is that an LLM token is not a human word. Sometimes it is. Sometimes it's not. It's typically some sub-party words. You'll see weird symbols when you look at tokenizers from different models.

But you want that first framing. So now we have text. We hit to a vector. So from there, each one of those LLM tokens had a corresponding embedding vector. So embedding vector is everything. We embed videos. We embed images. We embed text tokens. Think of it as a representation that an LLM can use to compare things and do math on.

So that's why we always want to convert into some vector representation. Because some vector representation is just some high dimensional coordinate space. And we're just rearranging objects. That's to some degree what you're doing. Okay. So from those token IDs, I went to the actual embedding vectors themselves. So if you look, make me sound smart, now becomes a matrix.

All right. Make me sound smarter becomes a matrix with an extra column. So in reality, what you're doing every time you submit a prompt, I don't care what LLM you submit it to, who you submit it to, this is what you're doing. All right. You are converting your text, now images as well.

They get converted to some IDA tokens or something like that. That'll be another interesting talk to do diffusion models, et cetera, et cetera. But you're really putting this large matrix on the GPU. So the next question you should ask is, okay, why are GPUs good for this workload? Because they process matrices really, really fast.

So that's sort of the advantage and the thing, hopefully that makes a lot more sense to you. Now, the next thing I want to talk about is how the LLM is going to process these tokens. And I'll keep in mind if any, well, I'm not even going to ask you to raise your hand.

I'm 100% sure each of you has used an LLM. If you have not, I'm not sure what's happening. The other one is the attention mechanism. I truly think it's one of the things that you should understand. If we ever drift away from it, that's fine. But the fundamentals of that mechanism and seeing sort of the innovations around that, I think, can help anyone, any business leader, et cetera, et cetera, just because you are able to speak a different kind of language in this generative future.

So as you think of the attention mechanism, the intuition that you should have is just that mechanism of relating tokens. How do I distinguish in a sentence what is important? And then for the next token that's going to be generated, hey, what tokens that I said before were really important for me to make a next good decision for that next token?

So that's the intuition. And now we're going to -- we won't necessarily touch too much of the math, but I want you to see sort of what's happening on the GPU. So once again, the prompt comes in. I'm just going to do a short one, make me sound smart.

I'm going to generate this token called LLM, all right? We saw these same matrices that I said before. So remember, my text now turns into a matrix hitting onto the GPU. And the main thing I want you to understand or visualize here is actually how an LLM memory works.

So now when you're speaking, you've recorded everything that I've said for the last 10 minutes in your brain, somewhere it's stored. So now you're going to see how the LLM is storing what it is that you just said. So from there, a lot of folks will hear about these query key and value matrices.

This is what the actual model weights look like. So when you look at a model weights file, if you go on Hugging Face, there's typically a JSON file that will show you all of the different pieces of model files. And you'll see this thing called Q, K, and V.

So I have these model weights. So now I've went from text to a matrix. I'm going to matrix multiply against the weights of the models. So now I get these three output models. So think of these weight matrices that I showed here. Think as -- when you're doing a projection, what you're doing is you're taking some coordinates and you're putting it into a different space.

That's really what you're doing when you do vector matrix math. So now when I do this matrix multiplication, this query key and value matrix -- so if you look at different tutorials on attention, you'll see these things pop up a lot. So hopefully that will help you to read it a lot more.

This is now the LLM's interpretation of each of those tokens that you sent in. Right? And now the job is how do I now take these query key and value matrices and sort of interpret it to try to generate the next best token. And this is just happening constantly over and over every single token that's happening.

But the key thing I want you to walk away on the slide is where I drew the key and the value. Right? When people talk about KV cache optimization, every LLM performance engineer is just literally trying to make that thing as fast and small as possible. And that will make a little more sense as to what that does to your cost.

But ultimately, these key and value matrices, this is like your LLM's memory. So it will make a little more sense coming up. I know I didn't show a ton of the math. I show some tutorials afterwards so you can go read more about that. My intention here is for you to visualize key and value.

So every time you see a prompt, I just want you to thinking, crap, key and value is on my GPU. Okay? The next. So here's the real value of the KV cache. So remember we said that whenever I generate a token, I'm going to push it back into the GPU.

Right? So every token I generate, it goes back into the GPU. And then I have to compute an attention mechanism. So this is what's happening. This new token I generated, LLM, I get its vector representation, as you see in blue. But now I do that vector matrix math now.

So before I did matrix matrix math, that's my first prompt first comes in. I generated my first token. Now I'm doing vector matrix math. You know, people will batch this across all requests, but I'm just showing you a single request so you can see it. Now, the value of the KV cache is, if I were to, if I didn't have the KV cache, I would have to reprocess all of that work I did on the prompt that I did before.

So this is the benefit of your KV KV cache. Now I'm just going to compute attention on this newest token. How does this new token relate to everything that I said before? That's the thing that's really happening intuitively. So if I have this KV cache, my generation is going to be fast.

And it's really up to what's called the batch manager on the GPU to make sure that I'm just pushing out as many tokens as possible. Okay. So if you look at an LLM, these groups of three matrices are called an attention head. There are more matrices than that, but these are the main ones.

LLAMA has 32 attention heads. So I just kind of want you to appreciate what an LLM really looks like. All right. So I have 32 sets of these matrices. I have 32 of those KV caches happening at the same time. And now I have to combine all of that to then generate the next token.

So there's an incredible amount of work that happens in a very short space of time to give you a coherent token. Okay. A good mental model for you to keep in your head -- I'm going to speed up a little bit -- is to -- if you see the number of parameters, multiply that by two, and that is your FP16 gigabyte memory on the GPU.

So if you have, let's say, an L4, I think is 20 gigs, and I have a Lama 8B, that's automatically 16 gigs FP16. So I only have four gigs left for my KV cache. So on the GPU, it's either the model weights or tokens. That's it. There's nothing else on the GPU.

And I have a thing to read on that. This is a really good blog. It shows you all of the different optimizations that you can do. Okay. Now let's talk about measuring. So if you ever see this thing called ISL or -- do I have it there? Oh, sorry.

ISL or OSL, that's input sequence link, output sequence link. So now I want you to see what some advanced monitoring might look like. If any of you are DevOps folks, these are things that you want to record. The first thing that we measure is time to first token. So how long does it take me to generate?

Process the prompt and then generate my first token. And that's typically a measurement of how good your attention mechanism processing is. That's really what you're trying to suss out. So that's time to first token. Into token latencies. So after I've generated my first token, every single token after that, I'm looking at those individual spaces.

So everything that's going to happen there, think about when the system is on the load. I have, you know, a thousand requests coming into my system, I'm generating a thousand sets of different tokens. And the more memory I occupy, typically that slows down processing. So if you start to see drift in this metric, then -- so I'll show you some plots that you can look at.

And then time to total generation. How long did it take me to initially get the prompt, fully finish the answer. All right? Super intuitive. Like I said, ISL, OSL, that's all that means when you see them on the plots coming up. Okay. This is a very important paradigm for you to understand in your mind.

So I worked with a lot of folks on, you know, maybe Rex's deployments or deployments of other types of models. So on the GPU, if you're only deploying one model on a GPU, outside of LLM inference, in my opinion, I think you're wasting the GPU. You can put multiple models on the GPU to actually increase your throughput.

That's why it was really created. So this is a slide -- excuse me. This figure is just showing I can have multiple models. I have some space for data and that's how I increase my throughput per unit hardware. However, on the LLM inference side, it's very different. I have one model.

You know, folks can fit multiple models on a GPU. That's cool, but that's not a real production use case. You'll typically have a single model. The remaining space that you have is all for KVCache and generating all those tokens. So I just put four different requests and I just kind of want you to see the boxes that are happening.

Okay. I would say this is the most important slide in the entire presentation because this is the thing that will determine both your cost and performance. So there are four different querying patterns that happen. And this is something that you must measure in your deployment because oftentimes you might read benchmarks and just say, all right, they'll cherry pick one or two of these.

But in reality, in your production system, you might have several of these different patterns that are occurring. So let's take a look at the first one. Long input, short output. So long input means it's going to take me technically longer to compute the attention mechanism. So my pre-filled stage will be longer.

It occupies more memory from my prompt. Does that make sense intuitively? Hopefully it's grabbing you. But then on the generation side, I don't generate much tokens. So there's not much -- those tokens are not picking up a lot of memory. And they will tend to finish fast. So the second one, or maybe the most costly use case is, so I have clients that will message me and say, hey, my data scientists are putting two bigger prompts on my GPUs.

So now they're killing my deployment. Because if everyone went and put the maximum context length, I can only fit so many requests on the GPU. So that's something for you to think about. You'll have to manage that internally with your deployments. So that's why I'm putting, you know, okay, the GPU is really full.

Because long text -- excuse me, long input, long output. The next one, short, long, you know, your time to first token will be really fast. I don't have much to compute the attention mechanism on. But hey, I'm generating a ton of tokens. That's really, really fast. So hopefully, as you start measuring these types of different query patterns, you'll see different results.

I just put, you know, what a random sampling set might actually look like on the GPU. Because not everyone will send the same length of input and output. So that will -- it'll be good for you to just sort of visualize and track these statistics. More importantly, why we're doing that internally -- I'm going to steal the time here, Peter.

More importantly, why we're doing that or why we're tracking these things is that the whole goal is to build -- I have a big model. My goal is to shrink it as much as I can, but to keep it as accurate as possible. So the more that I shrink, the faster it runs, the more GPU memory I have for what?

Tokens. All right? So that's how you really try to improve your cost. This is why I'm sort of proposing to you to build inference engines. So all I'm showing here is a 2D histogram of input sequence length versus output sequence length. Because the question that you'll have to answer is, hey, how long are my actual prompts?

Someone might say, okay, here's the max prompt length that you can ingest, and the max prompt you can out -- excuse me, get on the output. And all of the big box model providers have to estimate this when they go into costing or providing a service to you, right?

Because they have to host all of that machinery under the hood now that you understand what's happening. So we use this to statistically determine what is the max input sequence length and the max output sequence length across all of my users. And this will give you a really good indication of how you can size your engines.

We use that to actually build more optimized engines. In addition, it will just give you good view as to maybe what you call it, scaling out and things like that. The next one is time-to-first token analysis. Remember, time-to-first token is measuring my performance of the attention mechanism on the load.

So someone might show attention mechanism at one query. Show me attention mechanism on the load. When this thing is fully maxed out 24/7, that's when you really need to start measuring these types of things. So this is something you can look at. These are sort of experimental plots. There's a package called GenAI Perf that will be released open source.

It's out already. I have a link to it there. This is where it will generate these plots for you. But I'm just showing you what the engineers are looking at internally to measure the performance of the compute platform. Next, time to completion analysis. How long did it take me to go from start to finish across every single request.

Naturally, the wider that box plot, you have to intuitively ask what's happening. Why did this person's prompt take longer than another? So you can investigate either batching issues, scheduling issues, different things like that. I'll take questions in the end. Oh, I have to move really fast. Sorry there, Peter.

Okay. I'm going to speed up here. Token to token latency. Peter, how much time I got? Oh, you're fine. We'll definitely have time for the question. Okay, cool. I'm going to steal. I'm definitely over. Sorry. I realize I may have gone a little too fast. So forgive me for that.

No, you have five minutes. Cool. All right. Token to token latency. So that is I'm generating tokens. I'm looking at that spacing versus token position. So the longer a sequence sequence gets, remember, my memory grows. So typically that means that system is under more load. It has more throttling that might happen under high load of requests.

So if I see a large variation in token to token latency as the sequence gets longer when I'm generating, that means I'm not very performant. All right. So we look at that to see I try to make sure that that's constant, no matter how much tokens I'm generating. That means I'm really proficient.

Okay. Last one would be time to first token versus number of input tokens. So time to first token, remember, is computing the attention mechanism, okay, versus number of input tokens. So if I have a bigger prompt, my attentions will take longer. But if that plot goes up, like, from your perspective, it goes up like this in terms of sequence length, that's not really good performance.

We really look at that slope and we try to get that slope almost, you know, as low as possible. So if you send me this long sequence, I can get that thing done really fast. Okay. Okay. In terms of software, you will see this thing called TRT LLM. Triton is an open source inference server.

So you can deploy models on CPU, on GPU, computer vision, Reksis, Python, PyTorch, TensorFlow. It will host all of the different types of models. So there's one way that your deployment team deploys. All their data scientists are happy because they don't have to do conversion. You're happy as a deployment person because you don't have to manage a TorchServe versus TFServe and Flask and all of it is done through one.

It's written in C++, blazingly fast. And then the other thing you'll see in video, you'll see a lot more coming out of NVIDIA's NVIDIA inference microservice because building these engines, getting them deployed, optimize the scale, it's not easy. So we've sort of made that easy for you as an enterprise offering, but you guys can try it out for free.

Okay. So TRT LLM, let me just give you lots of stuff on this slide. But the main thing I want you to walk away with is this is the model compilation package for LLMs on NVIDIA GPUs. If you want to get best performance from NVIDIA GPUs, please make sure you use TRT LLM.

Unnaturally, once we're investing more in NIM, you'll see some more things come out. So you'll see performances on A100 and H100. Really focus on FP8 GPUs. So FP8 will be Hopper and Ada Lovelace. Okay. So FP8, I'll talk a bit more about that, what the advantage there is. But mainly is if I go from FP16, FP8 is this.

Half my memory. Almost the same accuracy. And so we measure the accuracy and we publish the accuracy. So now I have this much more space for tokens. But more importantly, this model is that much faster. Okay. So I want you to understand where the sort of industry is going.

This is why Hopper, the world ate Hopper for breakfast and lunch and dinner because of FP8. It gave folks that cost benefit to do this thing a lot faster. Okay. In-flight batching, it just means I don't have to wait for all the requests to finish to start a new request.

The moment your request finishes, I can inject a new request while others are going. Okay. Tons of features here. I put the features. So some ones to focus on are quantized KV cache. So I can actually represent my KV cache in different precision. So that means I'm actively shrinking that memory, making it more performant.

You have page KV cache. That's just you managing your GPUs a lot better in terms of all of that memory. So there are tons of things you can do. Tenser parallelism. The thing to remember about tensor parallelism, if you want to increase latency, use tensor parallelism. Split the model up across multiple GPUs.

That's typically done within a node. I repeat that. That's typically done within a node. You don't like to do tensor parallelism across a node. You'll see pipeline parallelism go across a node. Pipeline parallelism is more sequential, so I process this chunk. So in a multi-node model, like huge models, this box will finish and pass off to the next box.

But most folks will typically just work -- most models will work within a single node. So those are some of the things. In terms of models that you have access to, we optimize those models and we give you a lot of scripts where you can go do that on your own, or you can sort of take our software and take an easy path.

Either way, we support you. So here are some of the models that are there. All of the LAMAs, Mixtrel, Mixtrels, we work with all those teams behind the scenes. So typically before any foundation model comes out, we work with those teams to get them deployed. Okay, what does it mean for TensorRT?

So you might have seen TensorRT before, which was a deep learning compilation package for NVIDIA GPUs. Lots of folks in computer vision, et cetera, et cetera, have used that. We took the best practices from there and added all of the extra things that need to happen in the LLM inference loop.

So that's what TRT LLM is really about. So mainly focus on LLM inference. Here's a good visual. An engine that's built to a specific GPU cannot be moved to another GPU. So you always have to compile to that GPU. That's why it's that performant, because we really leverage all of the actual hardware on that system to rewrite the algorithm, rewrite that model to that specific piece of hardware.

Okay. TRT LLM and Triton. So TRT LLM will give me an inference engine. I need something to host that inference engine and accept requests, batching, et cetera, et cetera. So we have Triton. Triton works very simply. It's literally a folder where you specify it works on tensor in and tensor out.

So it will tell you what are my inputs coming in and out. And then it will basically understand how to interpret that file. Or you can host any other different models. That's a thing I do a lot with folks. Just two more slides. This is where the future of inference is going.

So a lot of folks do FP16 inference today. A lot of folks are moving towards FP8 just because, hey, I now half the model size, almost twice the speed, more space for tokens. It just makes more sense from a cost perspective. That's why folks like that. And then you saw Blackwell was announced.

That's the major innovation. I get FP4. So that's where things are really going to get interesting. I'll end with NVIDIA inference microservice. So we've made this thing really easy. We've gone and actually found the best configurations for all of these models on each piece of GPU. And we're slowly rolling out all of the models because it, you know, will just take some time to optimize the world, essentially.

And yeah, you can use this to download all the slides. I put papers. Tons of other things for you to read. So, yeah. Hopefully, your intuition has sharpened. Shall we just conclude with the, because if someone had a question. Sure. Yeah, I think. Where was the question? Yeah. Oh, hang on.

I'm going to come over and point my mic at you. Thank you. So, hi. Sorry. My question is actually on the heat map that you shared. Yeah, yeah. Do you mind walking through the heat map and how to interpret it? Because it was a little small. Yeah, sorry about that.

Yeah. So the heat map, Thanks. All I'm looking at is, um, so when you go to build an engine, you build an engine to the max input sequence length and the max output sequence length. So we actually change how that matrix math is happening under the hood based on those settings.

So you might say, all right, my users are only going to send 4,000 tokens. But in reality, they might have been sending 1,300 over the past week that you measured. So now you can stay with statistical certainty that, hey, the majority of people that were serving, um, during this time, these were the querying patterns.

So I can rebuild an engine for that period of time. What gets super interesting, this is a topic I'm very interested in, is seasonal engines. So during the day, you have different querying patterns. So you'll scale down, you'll scale up. And so you might have different engines built for different types of querying patterns based on traffic and stuff like that.

So hopefully that may have answered the question, yeah. But it's just saying, you know, looking at the bounds of what's the minimum number of tokens that came in, the max, uh, min, min out and max out, and just looking at that over the entire distribution. Yes, sir? Oh, well, yeah, right there.

When it comes to those, uh, inference strategies you talked about, like Lilo and Liso, um, how do you, what kind of strategies do you have to manage? Like which ones are used at, like, because obviously each session is going to be pretty generic. You don't know which one to use at first.

Correct. Um, do you split those between GPUs or do you stick with one and does it switch between...? So typically we'll, we'll go to, you try to find what's one configuration that will manage the, the plethora of types of requests that you have coming in. So we, we're typically at a, a one engine per all the different querying types.

And I think you'll start seeing, I'm giving you a little bit of future ways to think about it on the DevOps side because that's something you'll have to test, right? If I look at this querying pattern that came into my system with this engine, if I switch the engine, does it still satisfy the querying pattern?

And how much cost does it save? How much faster is it? So that's more of a, an engineering exercise that you'll have to deploy. Sorry, I, I didn't have a... Yeah, yeah, so I, I just, I'm very interested in the seasonal side just because, okay, querying patterns will change.

Um, especially when agents come, it'll just be, that's going to get super interesting when agents are just throwing stuff. Yes, sir? Um, so question about how you measure quality of attention. Um, is it, is it correct intuition to think that attention is a fundamentally scarce resource in the sense of, it's about paying attention to one thing at the expense of other contexts?

So then how can you, like, scale attention mechanisms infinitely the way we can conduct? Yeah, so, so what people do in order to scale the attention mechanism is, here's another interesting fact that, um, why folks don't train huge context models, because it's actually, now you've seen, the bigger my, uh, prompt, the more memory I need.

So imagine what that does to a huge, I don't know, 10,000, 100,000 GPU deployment. It might make it a million GPUs just to do that context length. So people will train the small context length and then interpolate in that value to, to give you that length of context length and then you're sort of bound to what attention mechanism you were using.

Designed there, there's things like flash attention that will just do everything in the L1 cache really, really fast. So it depends on the speed of some of the different, it also depends on the GPU as well. So that's why, um, if you look at Blackwell that was announced by Jensen, they literally have connected, I think, 72 different GPUs on one NVLink.

So NVLink connects GPUs together, that's how we can move data insanely fast, and now we've connected like 72 GPUs on one. That's, that's just to show you, um, like mixture of experts trying to compute attention across all of these different things. But that's actually a really good question. No, I don't necessarily think so.

Like the entire industry is, you know, going after that problem. That's why everybody wants to maybe see something other than attention and, ah, you know, there's so much excitement there. Yeah, unfortunately you don't have to call Tom on us now, but that's been fantastic. Thank you, Mo.

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

Transcript