back to indexMastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

00:00:02.000 |
- It's very difficult to teach extremely technical material 00:00:16.680 |
Initially, I had planned for at least a 45-minute session, 00:00:19.600 |
so I left some reading material for you at the end. 00:00:22.440 |
And all of the resources, you could download slides 00:00:25.160 |
and everything, so feel free to take screenshots or not. 00:00:27.920 |
And so I work at NVIDIA, I'm a solutions architect, 00:00:32.440 |
and it's my job to essentially work with those clients, 00:00:35.160 |
understand sort of what their main challenges are. 00:00:45.360 |
So my hope today is that you get a better intuition 00:00:49.480 |
of exactly what's happening with this particular workload 00:00:52.160 |
and how you go about, to some degree, sizing things, 00:00:55.200 |
choosing different GPUs, et cetera, et cetera, 00:00:57.680 |
and more importantly, controlling the cost of a deployment, 00:01:09.680 |
Most folks that I've seen are doing some kind of hybrid, 00:01:21.080 |
or some fine-tuned model that you have internally. 00:01:25.160 |
So just reference, if you go to build.nvidia.com or ai.nvidia.com, 00:01:30.080 |
everyone can get 1,000 inference requests for free. 00:01:33.040 |
So I typically recommend this to folks who are benchmarking 00:01:41.400 |
If you're teaching a course and you are trying to evaluate 00:01:44.680 |
all of the different LLMs that are out there for your business, 00:01:47.640 |
there are also multi-modal LLMs, speech LLMs. 00:01:51.040 |
Every model that NVIDIA accelerates will be available there for you. 00:01:55.520 |
And that's sort of a path to you to either go optimize them yourselves 00:02:00.560 |
You'll see things about NVIDIA inference microservice 00:02:03.600 |
and all of those things that you can take to enterprise. 00:02:06.280 |
So we have sometimes the, I'll call it the rocky road, 00:02:11.200 |
Whatever path you want to take, we're here to support you. 00:02:17.040 |
I want you to understand the LLM inference workload, 00:02:19.960 |
and then we'll move to how you go about measuring a production deployment 00:02:24.200 |
and some of the things you need to be watching. 00:02:26.200 |
It's a little more than, let's say, you know, the total time to generation 00:02:30.080 |
and really understanding what's happening on the GPUs as you sort of scale out. 00:02:35.720 |
I think it's very important for you to just have that intuition. 00:02:38.480 |
And then lastly, I'll show you some software that you can use, 00:02:46.400 |
Okay, we're going to get into the LLM inference workload itself. 00:02:50.360 |
So the first part is really understanding what happens 00:02:57.200 |
I'm saying, okay, write me a presentation so I sound smart. 00:03:01.840 |
and you guys are maybe going to like to talk. 00:03:04.160 |
And essentially what I'm going to do is I'm going to put that on the GPU. 00:03:06.640 |
So the moment that I send that prompt on the GPU, it stays on the GPU. 00:03:11.800 |
And then from there, I'm going to generate one token at a time. 00:03:14.680 |
So I'm generating the tokens, LLM inference is hard, 00:03:22.720 |
no matter how fast anyone claims they're doing things, 00:03:26.320 |
it's typically one token that's generated at a time. 00:03:31.320 |
The next thing is that in order for an LLM to give you a coherent answer, 00:03:37.400 |
you have to remember every single thing that you said before, 00:03:40.240 |
and that you'll understand the mechanism of how LLMs are able to do that. 00:03:45.240 |
So that's why I'm putting LLM inference is in red and putting that back onto the GPU. 00:03:50.240 |
So every token that I generate gets locked onto the GPU, and then you'll actually see what that looks like in terms of vectors. 00:03:56.680 |
How many of you have heard of KvCache before? 00:04:03.120 |
Typically, I don't see maybe many leaders hear about this thing called KvCache. 00:04:07.120 |
KvCache is this thing that really drives to some degree the cost. 00:04:11.640 |
So whether or not you use some big box API or you're using a single GPU, 00:04:17.080 |
it's all the same sort of mechanisms, the same algorithm that everyone is trying to solve. 00:04:21.080 |
So in terms of steps, here I like to, as I said, sharpen your intuition. 00:04:27.880 |
So the first thing, if we move from the left, my first job is to convert these texts, 00:04:31.760 |
whatever texts that you send, we're going to focus on LLM inference, 00:04:37.800 |
So the model will have its own vocabulary, and it's my job to translate that. 00:04:41.400 |
I'll give you the technical terms coming up after that. 00:04:44.600 |
And the first thing that happens is I do some initial prompt processing. 00:04:48.200 |
So I have to compute the attention mechanism on the entire prompt. 00:04:53.640 |
I have to compute the attention mechanism on the entire prompt per user. 00:04:57.800 |
So if I have a million people hitting my service and a million people send 10,000 tokens, 00:05:02.840 |
that's a million times 10,000 attention mechanisms that I need to compute, 00:05:07.560 |
also while generating tokens for other people. 00:05:10.360 |
So it's good for you to appreciate sort of that complexity that's happening. 00:05:16.760 |
then I'm going to start generating one token at a time. 00:05:21.000 |
And then from there, every token that gets generated, that's in the LLM's vocabulary, 00:05:26.360 |
I need to now de-tokenize that back into your language. 00:05:30.200 |
So here's the technical terms that you'll see when you read the literature, 00:05:40.360 |
And the thing to think about, when you think of tokenizers, when they did pre-training, 00:05:46.760 |
they downloaded the internet and some, right? 00:05:49.960 |
And they cleaned it up, et cetera, et cetera. 00:05:51.800 |
So tokenizer, and as you start thinking of the complexity across languages, coding languages, 00:05:56.840 |
regions, et cetera, et cetera, they tried to get what is the minimal set of character groups that can 00:06:04.600 |
represent this entire training data set efficiently. 00:06:09.400 |
So for instance, the LLAMA tokenizer has 128,000 tokens, all right? 00:06:15.160 |
So here's what it actually kind of looks like on a GPU. 00:06:22.360 |
Pre-fill is the stage where you compute the attention mechanism. 00:06:25.400 |
And many people are doing advancements with attention mechanisms. 00:06:32.360 |
People leverage all of the different types of memory hierarchies in GPUs to really accelerate this type of workload. 00:06:39.800 |
And then I start generating tokens one at a time. 00:06:42.520 |
The red and the green just signify, hey, I'm storing those tokens on the GPU. 00:06:49.160 |
So hopefully that makes sort of intuitive sense. 00:06:52.680 |
So the other thing I want you to visualize, I think it's nice to visualize what is the actual data that sits on the GPU. 00:07:01.720 |
So first, a token is approximately four characters. 00:07:10.680 |
So the first vector is just showing token one through token V. 00:07:14.200 |
V is the total number of tokens that I have in my tokenizer. 00:07:16.920 |
And the second vector below is just I have some numeric index. 00:07:21.640 |
I don't want to keep using the token to reference itself. 00:07:26.760 |
So my job when a prompt comes in is to convert that text into those token, what I'm going to call token IDs. 00:07:35.400 |
So I have make me sound smart and make me sound smarter. 00:07:40.680 |
And the key thing I want you to walk away from that distinction is that an LLM token is not a human word. 00:07:51.720 |
You'll see weird symbols when you look at tokenizers from different models. 00:07:59.400 |
So from there, each one of those LLM tokens had a corresponding embedding vector. 00:08:13.000 |
Think of it as a representation that an LLM can use to compare things and do math on. 00:08:19.160 |
So that's why we always want to convert into some vector representation. 00:08:23.960 |
Because some vector representation is just some high dimensional coordinate space. 00:08:33.160 |
So from those token IDs, I went to the actual embedding vectors themselves. 00:08:38.920 |
So if you look, make me sound smart, now becomes a matrix. 00:08:42.760 |
Make me sound smarter becomes a matrix with an extra column. 00:08:46.120 |
So in reality, what you're doing every time you submit a prompt, 00:08:49.800 |
I don't care what LLM you submit it to, who you submit it to, this is what you're doing. 00:08:55.240 |
You are converting your text, now images as well. 00:08:58.440 |
They get converted to some IDA tokens or something like that. 00:09:01.640 |
That'll be another interesting talk to do diffusion models, et cetera, et cetera. 00:09:05.640 |
But you're really putting this large matrix on the GPU. 00:09:09.880 |
So the next question you should ask is, okay, why are GPUs good for this workload? 00:09:14.520 |
Because they process matrices really, really fast. 00:09:17.320 |
So that's sort of the advantage and the thing, hopefully that makes a lot more sense to you. 00:09:22.360 |
Now, the next thing I want to talk about is how the LLM is going to process these tokens. 00:09:27.320 |
And I'll keep in mind if any, well, I'm not even going to ask you to raise your hand. 00:09:33.640 |
If you have not, I'm not sure what's happening. 00:09:39.800 |
I truly think it's one of the things that you should understand. 00:09:47.800 |
But the fundamentals of that mechanism and seeing sort of the innovations around that, 00:09:52.840 |
I think, can help anyone, any business leader, et cetera, et cetera, 00:09:56.040 |
just because you are able to speak a different kind of language in this generative future. 00:10:01.320 |
So as you think of the attention mechanism, the intuition that you should have is just 00:10:07.480 |
How do I distinguish in a sentence what is important? 00:10:10.360 |
And then for the next token that's going to be generated, hey, what tokens that I said before 00:10:16.120 |
were really important for me to make a next good decision for that next token? 00:10:21.960 |
And now we're going to -- we won't necessarily touch too much of the math, 00:10:24.520 |
but I want you to see sort of what's happening on the GPU. 00:10:29.240 |
I'm just going to do a short one, make me sound smart. 00:10:31.800 |
I'm going to generate this token called LLM, all right? 00:10:34.840 |
We saw these same matrices that I said before. 00:10:37.960 |
So remember, my text now turns into a matrix hitting onto the GPU. 00:10:41.960 |
And the main thing I want you to understand or visualize here is actually how 00:10:49.720 |
So now when you're speaking, you've recorded everything that I've said for the last 00:10:54.680 |
10 minutes in your brain, somewhere it's stored. 00:10:57.960 |
So now you're going to see how the LLM is storing what it is that you just said. 00:11:01.320 |
So from there, a lot of folks will hear about these query key and value matrices. 00:11:06.840 |
This is what the actual model weights look like. 00:11:09.080 |
So when you look at a model weights file, if you go on Hugging Face, there's typically a JSON file 00:11:13.720 |
that will show you all of the different pieces of model files. 00:11:16.920 |
And you'll see this thing called Q, K, and V. So I have these model weights. 00:11:23.720 |
I'm going to matrix multiply against the weights of the models. 00:11:29.320 |
So think of these weight matrices that I showed here. 00:11:32.520 |
Think as -- when you're doing a projection, what you're doing is you're taking 00:11:37.000 |
some coordinates and you're putting it into a different space. 00:11:40.360 |
That's really what you're doing when you do vector matrix math. 00:11:43.160 |
So now when I do this matrix multiplication, this query key and value matrix -- so if you look at 00:11:48.840 |
different tutorials on attention, you'll see these things pop up a lot. 00:11:52.280 |
So hopefully that will help you to read it a lot more. 00:11:54.440 |
This is now the LLM's interpretation of each of those tokens that you sent in. 00:11:59.960 |
Right? And now the job is how do I now take these query key and value matrices and sort of interpret it 00:12:07.240 |
to try to generate the next best token. And this is just happening constantly over and over every single 00:12:12.920 |
token that's happening. But the key thing I want you to walk away on the slide is where I drew the key 00:12:17.640 |
and the value. Right? When people talk about KV cache optimization, every LLM performance engineer is just 00:12:24.600 |
literally trying to make that thing as fast and small as possible. And that will make a little more sense 00:12:30.440 |
as to what that does to your cost. But ultimately, these key and value matrices, this is like your LLM's 00:12:35.720 |
memory. So it will make a little more sense coming up. I know I didn't show a ton of the math. I show 00:12:39.960 |
some tutorials afterwards so you can go read more about that. My intention here is for you to visualize 00:12:45.320 |
key and value. So every time you see a prompt, I just want you to thinking, crap, key and value is on 00:12:49.720 |
my GPU. Okay? The next. So here's the real value of the KV cache. So remember we said that whenever I 00:13:00.120 |
generate a token, I'm going to push it back into the GPU. Right? So every token I generate, it goes back 00:13:05.560 |
into the GPU. And then I have to compute an attention mechanism. So this is what's happening. This new 00:13:10.200 |
token I generated, LLM, I get its vector representation, as you see in blue. But now I do that vector matrix math 00:13:18.920 |
now. So before I did matrix matrix math, that's my first prompt first comes in. I generated my first token. 00:13:26.120 |
Now I'm doing vector matrix math. You know, people will batch this across all requests, but I'm just showing you 00:13:32.360 |
a single request so you can see it. Now, the value of the KV cache is, if I were to, if I didn't have the KV 00:13:40.840 |
cache, I would have to reprocess all of that work I did on the prompt that I did before. So this is the benefit of your KV 00:13:48.120 |
KV cache. Now I'm just going to compute attention on this newest token. How does this new token relate 00:13:53.400 |
to everything that I said before? That's the thing that's really happening intuitively. So if I have 00:13:59.320 |
this KV cache, my generation is going to be fast. And it's really up to what's called the batch manager 00:14:06.520 |
on the GPU to make sure that I'm just pushing out as many tokens as possible. Okay. So if you look at 00:14:12.920 |
an LLM, these groups of three matrices are called an attention head. There are more matrices than that, 00:14:18.520 |
but these are the main ones. LLAMA has 32 attention heads. So I just kind of want you to appreciate 00:14:23.800 |
what an LLM really looks like. All right. So I have 32 sets of these matrices. I have 32 of those KV 00:14:30.200 |
caches happening at the same time. And now I have to combine all of that to then generate the next 00:14:34.840 |
token. So there's an incredible amount of work that happens in a very short space of time to give you a 00:14:40.840 |
coherent token. Okay. A good mental model for you to keep in your head -- I'm going to speed up a little bit -- is 00:14:47.000 |
to -- if you see the number of parameters, multiply that by two, and that is your FP16 gigabyte memory on 00:14:54.120 |
the GPU. So if you have, let's say, an L4, I think is 20 gigs, and I have a Lama 8B, that's automatically 00:15:01.800 |
16 gigs FP16. So I only have four gigs left for my KV cache. So on the GPU, it's either the model weights 00:15:09.480 |
or tokens. That's it. There's nothing else on the GPU. And I have a thing to read on that. This is a 00:15:14.760 |
really good blog. It shows you all of the different optimizations that you can do. 00:15:19.720 |
Okay. Now let's talk about measuring. So if you ever see this thing called ISL or -- do I have it 00:15:27.000 |
there? Oh, sorry. ISL or OSL, that's input sequence link, output sequence link. So now I want you to 00:15:33.560 |
see what some advanced monitoring might look like. If any of you are DevOps folks, these are things that 00:15:37.880 |
you want to record. The first thing that we measure is time to first token. So how long does it take me to 00:15:42.920 |
generate? Process the prompt and then generate my first token. And that's typically a measurement 00:15:49.000 |
of how good your attention mechanism processing is. That's really what you're trying to suss out. 00:15:54.840 |
So that's time to first token. Into token latencies. So after I've generated my first token, 00:15:59.880 |
every single token after that, I'm looking at those individual spaces. So everything that's going to 00:16:04.680 |
happen there, think about when the system is on the load. I have, you know, a thousand requests coming 00:16:09.800 |
into my system, I'm generating a thousand sets of different tokens. And the more memory I occupy, 00:16:14.760 |
typically that slows down processing. So if you start to see drift in this metric, then -- so I'll show 00:16:19.800 |
you some plots that you can look at. And then time to total generation. How long did it take me to 00:16:24.280 |
initially get the prompt, fully finish the answer. All right? Super intuitive. Like I said, ISL, 00:16:29.880 |
OSL, that's all that means when you see them on the plots coming up. Okay. This is a very important 00:16:36.520 |
paradigm for you to understand in your mind. So I worked with a lot of folks on, you know, maybe 00:16:41.880 |
Rex's deployments or deployments of other types of models. So on the GPU, if you're only deploying one 00:16:47.640 |
model on a GPU, outside of LLM inference, in my opinion, I think you're wasting the GPU. You can put 00:16:52.920 |
multiple models on the GPU to actually increase your throughput. That's why it was really created. 00:16:57.240 |
So this is a slide -- excuse me. This figure is just showing I can have multiple models. I have some 00:17:03.800 |
space for data and that's how I increase my throughput per unit hardware. However, on the LLM inference side, 00:17:09.720 |
it's very different. I have one model. You know, folks can fit multiple models on a GPU. That's cool, 00:17:14.680 |
but that's not a real production use case. You'll typically have a single model. The remaining space 00:17:19.400 |
that you have is all for KVCache and generating all those tokens. So I just put four different 00:17:24.280 |
requests and I just kind of want you to see the boxes that are happening. Okay. I would say this is 00:17:29.400 |
the most important slide in the entire presentation because this is the thing that will determine 00:17:33.800 |
both your cost and performance. So there are four different querying patterns that happen. And this 00:17:39.720 |
is something that you must measure in your deployment because oftentimes you might read benchmarks and 00:17:45.000 |
just say, all right, they'll cherry pick one or two of these. But in reality, in your production 00:17:49.880 |
system, you might have several of these different patterns that are occurring. So let's take a look 00:17:54.520 |
at the first one. Long input, short output. So long input means it's going to take me technically longer 00:18:01.160 |
to compute the attention mechanism. So my pre-filled stage will be longer. It occupies more memory from 00:18:06.760 |
my prompt. Does that make sense intuitively? Hopefully it's grabbing you. But then on the generation side, 00:18:12.360 |
I don't generate much tokens. So there's not much -- those tokens are not picking up a lot of memory. 00:18:17.400 |
And they will tend to finish fast. So the second one, or maybe the most costly use case is, 00:18:23.560 |
so I have clients that will message me and say, hey, my data scientists are putting two bigger prompts 00:18:28.200 |
on my GPUs. So now they're killing my deployment. Because if everyone went and put the maximum context 00:18:33.800 |
length, I can only fit so many requests on the GPU. So that's something for you to think about. You'll have to 00:18:38.920 |
manage that internally with your deployments. So that's why I'm putting, you know, okay, the GPU is 00:18:43.960 |
really full. Because long text -- excuse me, long input, long output. The next one, short, long, 00:18:49.080 |
you know, your time to first token will be really fast. I don't have much to compute the attention mechanism 00:18:53.880 |
on. But hey, I'm generating a ton of tokens. That's really, really fast. So hopefully, as you start 00:18:59.080 |
measuring these types of different query patterns, you'll see different results. I just put, you know, what a random 00:19:05.800 |
sampling set might actually look like on the GPU. Because not everyone will send the same length of 00:19:10.600 |
input and output. So that will -- it'll be good for you to just sort of visualize and track these 00:19:15.640 |
statistics. More importantly, why we're doing that internally -- I'm going to steal the time here, Peter. 00:19:21.400 |
More importantly, why we're doing that or why we're tracking these things is that the whole goal is to 00:19:27.160 |
build -- I have a big model. My goal is to shrink it as much as I can, but to keep it as accurate as possible. 00:19:33.960 |
So the more that I shrink, the faster it runs, the more GPU memory I have for what? 00:19:38.840 |
Tokens. All right? So that's how you really try to improve your cost. This is why I'm sort of 00:19:44.680 |
proposing to you to build inference engines. So all I'm showing here is a 2D histogram of 00:19:49.880 |
input sequence length versus output sequence length. Because the question that you'll have to answer is, 00:19:54.600 |
hey, how long are my actual prompts? Someone might say, okay, here's the max prompt length that you can 00:20:00.760 |
ingest, and the max prompt you can out -- excuse me, get on the output. And all of the big box model 00:20:07.000 |
providers have to estimate this when they go into costing or providing a service to you, right? 00:20:12.200 |
Because they have to host all of that machinery under the hood now that you understand what's happening. 00:20:17.640 |
So we use this to statistically determine what is the max input sequence length and the max output 00:20:24.280 |
sequence length across all of my users. And this will give you a really good indication of how you can 00:20:29.880 |
size your engines. We use that to actually build more optimized engines. In addition, it will just 00:20:35.960 |
give you good view as to maybe what you call it, scaling out and things like that. The next one is 00:20:41.800 |
time-to-first token analysis. Remember, time-to-first token is measuring my performance of the attention 00:20:47.080 |
mechanism on the load. So someone might show attention mechanism at one query. Show me 00:20:53.240 |
attention mechanism on the load. When this thing is fully maxed out 24/7, that's when you really need 00:20:58.840 |
to start measuring these types of things. So this is something you can look at. These are sort of 00:21:02.520 |
experimental plots. There's a package called GenAI Perf that will be released open source. It's out 00:21:08.520 |
already. I have a link to it there. This is where it will generate these plots for you. But I'm just 00:21:13.560 |
showing you what the engineers are looking at internally to measure the performance of the 00:21:18.040 |
compute platform. Next, time to completion analysis. How long did it take me to go from start to finish 00:21:23.320 |
across every single request. Naturally, the wider that box plot, you have to intuitively ask what's 00:21:30.360 |
happening. Why did this person's prompt take longer than another? So you can investigate either batching 00:21:36.040 |
issues, scheduling issues, different things like that. I'll take questions in the end. Oh, I have to move 00:21:40.920 |
really fast. Sorry there, Peter. Okay. I'm going to speed up here. Token to token latency. Peter, 00:21:45.880 |
how much time I got? Oh, you're fine. We'll definitely have time for the question. 00:21:49.400 |
Okay, cool. I'm going to steal. I'm definitely over. Sorry. I realize I may have gone a little too 00:21:54.360 |
fast. So forgive me for that. No, you have five minutes. Cool. All right. Token to token latency. 00:21:59.880 |
So that is I'm generating tokens. I'm looking at that spacing versus token position. So the longer a sequence 00:22:06.920 |
sequence gets, remember, my memory grows. So typically that means that system is under more 00:22:11.880 |
load. It has more throttling that might happen under high load of requests. So if I see a large 00:22:18.280 |
variation in token to token latency as the sequence gets longer when I'm generating, that means I'm not 00:22:25.160 |
very performant. All right. So we look at that to see I try to make sure that that's constant, 00:22:29.720 |
no matter how much tokens I'm generating. That means I'm really proficient. Okay. 00:22:34.600 |
Last one would be time to first token versus number of input tokens. So time to first token, 00:22:41.480 |
remember, is computing the attention mechanism, okay, versus number of input tokens. So if I have a bigger 00:22:47.480 |
prompt, my attentions will take longer. But if that plot goes up, like, from your perspective, 00:22:53.400 |
it goes up like this in terms of sequence length, that's not really good performance. We really look at 00:22:58.120 |
that slope and we try to get that slope almost, you know, as low as possible. So if you send me this 00:23:04.040 |
long sequence, I can get that thing done really fast. Okay. Okay. In terms of software, you will see 00:23:10.520 |
this thing called TRT LLM. Triton is an open source inference server. So you can deploy models on CPU, 00:23:17.000 |
on GPU, computer vision, Reksis, Python, PyTorch, TensorFlow. It will host all of the different types of models. 00:23:23.960 |
So there's one way that your deployment team deploys. All their data scientists are happy 00:23:28.040 |
because they don't have to do conversion. You're happy as a deployment person because you don't 00:23:31.720 |
have to manage a TorchServe versus TFServe and Flask and all of it is done through one. It's written 00:23:37.560 |
in C++, blazingly fast. And then the other thing you'll see in video, you'll see a lot more coming out 00:23:42.600 |
of NVIDIA's NVIDIA inference microservice because building these engines, getting them deployed, 00:23:47.240 |
optimize the scale, it's not easy. So we've sort of made that easy for you as an enterprise offering, 00:23:51.560 |
but you guys can try it out for free. Okay. So TRT LLM, let me just give you lots of stuff 00:23:57.800 |
on this slide. But the main thing I want you to walk away with is this is the model compilation package 00:24:03.560 |
for LLMs on NVIDIA GPUs. If you want to get best performance from NVIDIA GPUs, please make sure 00:24:09.240 |
you use TRT LLM. Unnaturally, once we're investing more in NIM, you'll see some more things come out. 00:24:15.560 |
So you'll see performances on A100 and H100. Really focus on FP8 GPUs. So FP8 will be Hopper and Ada Lovelace. 00:24:24.520 |
Okay. So FP8, I'll talk a bit more about that, what the advantage there is. But mainly is if I go from FP16, 00:24:31.720 |
FP8 is this. Half my memory. Almost the same accuracy. And so we measure the accuracy and we publish the 00:24:38.200 |
accuracy. So now I have this much more space for tokens. But more importantly, this model is that much 00:24:43.480 |
faster. Okay. So I want you to understand where the sort of industry is going. This is why Hopper, 00:24:48.600 |
the world ate Hopper for breakfast and lunch and dinner because of FP8. It gave folks that 00:24:54.120 |
cost benefit to do this thing a lot faster. Okay. In-flight batching, it just means I don't have 00:25:00.840 |
to wait for all the requests to finish to start a new request. The moment your request finishes, 00:25:05.640 |
I can inject a new request while others are going. Okay. Tons of features here. I put the features. 00:25:11.720 |
So some ones to focus on are quantized KV cache. So I can actually represent my KV cache in different 00:25:19.240 |
precision. So that means I'm actively shrinking that memory, 00:25:23.720 |
making it more performant. You have page KV cache. That's just you managing your GPUs a lot better 00:25:29.240 |
in terms of all of that memory. So there are tons of things you can do. 00:25:32.600 |
Tenser parallelism. The thing to remember about tensor parallelism, if you want to increase latency, 00:25:37.720 |
use tensor parallelism. Split the model up across multiple GPUs. That's typically done within a node. 00:25:44.440 |
I repeat that. That's typically done within a node. You don't like to do tensor parallelism across a node. 00:25:49.960 |
You'll see pipeline parallelism go across a node. Pipeline parallelism is more sequential, 00:25:54.760 |
so I process this chunk. So in a multi-node model, like huge models, this box will finish and pass off to 00:26:01.960 |
the next box. But most folks will typically just work -- most models will work within a single node. 00:26:07.320 |
So those are some of the things. In terms of models that you have access to, we optimize those models 00:26:22.280 |
and we give you a lot of scripts where you can go do that on your own, or you can sort of take our 00:26:26.280 |
software and take an easy path. Either way, we support you. So here are some of the models that 00:26:31.480 |
are there. All of the LAMAs, Mixtrel, Mixtrels, we work with all those teams behind the scenes. So 00:26:36.440 |
typically before any foundation model comes out, we work with those teams to get them deployed. 00:26:41.480 |
Okay, what does it mean for TensorRT? So you might have seen TensorRT before, which was a deep learning 00:26:48.200 |
compilation package for NVIDIA GPUs. Lots of folks in computer vision, et cetera, et cetera, have used that. 00:26:53.720 |
We took the best practices from there and added all of the extra things that need to happen 00:26:58.440 |
in the LLM inference loop. So that's what TRT LLM is really about. So mainly focus on LLM inference. 00:27:05.720 |
Here's a good visual. An engine that's built to a specific GPU cannot be moved to another GPU. So you 00:27:12.840 |
always have to compile to that GPU. That's why it's that performant, because we really leverage 00:27:18.440 |
all of the actual hardware on that system to rewrite the algorithm, rewrite that model to that specific 00:27:24.680 |
piece of hardware. Okay. TRT LLM and Triton. So TRT LLM will give me an inference engine. I need something 00:27:32.200 |
to host that inference engine and accept requests, batching, et cetera, et cetera. So we have Triton. 00:27:37.400 |
Triton works very simply. It's literally a folder where you specify it works on tensor in and tensor out. 00:27:43.880 |
So it will tell you what are my inputs coming in and out. And then it will basically understand 00:27:48.600 |
how to interpret that file. Or you can host any other different models. That's a thing I do a lot 00:27:54.120 |
with folks. Just two more slides. This is where the future of inference is going. So a lot of folks 00:27:58.920 |
do FP16 inference today. A lot of folks are moving towards FP8 just because, hey, I now half the model size, 00:28:08.040 |
almost twice the speed, more space for tokens. It just makes more sense from a cost perspective. That's 00:28:13.560 |
why folks like that. And then you saw Blackwell was announced. That's the major innovation. I get FP4. 00:28:19.480 |
So that's where things are really going to get interesting. I'll end with NVIDIA inference microservice. 00:28:25.080 |
So we've made this thing really easy. We've gone and actually found the best configurations for all of 00:28:29.720 |
these models on each piece of GPU. And we're slowly rolling out all of the models because it, you know, 00:28:35.080 |
will just take some time to optimize the world, essentially. And yeah, you can use this to 00:28:40.360 |
download all the slides. I put papers. Tons of other things for you to read. So, yeah. Hopefully, 00:28:47.800 |
Shall we just conclude with the, because if someone had a question. 00:28:58.840 |
Oh, hang on. I'm going to come over and point my mic at you. 00:29:06.440 |
So, hi. Sorry. My question is actually on the heat map that you shared. 00:29:11.880 |
Do you mind walking through the heat map and how to interpret it? 00:29:16.680 |
Yeah, sorry about that. Yeah. So the heat map, 00:29:20.040 |
um, so when you go to build an engine, you build an engine to the max input sequence length and the max 00:29:26.200 |
output sequence length. So we actually change how that matrix math is happening under the hood based on 00:29:32.120 |
those settings. So you might say, all right, my users are only going to send 4,000 tokens. But in reality, 00:29:38.680 |
they might have been sending 1,300 over the past week that you measured. So now you can stay with 00:29:43.720 |
statistical certainty that, hey, the majority of people that were serving, um, during this time, 00:29:51.000 |
these were the querying patterns. So I can rebuild an engine for that period of time. What gets super 00:29:55.960 |
interesting, this is a topic I'm very interested in, is seasonal engines. So during the day, you have 00:30:00.920 |
different querying patterns. So you'll scale down, you'll scale up. And so you might have different 00:30:05.880 |
engines built for different types of querying patterns based on traffic and stuff like that. 00:30:10.520 |
So hopefully that may have answered the question, yeah. But it's just saying, you know, 00:30:15.560 |
looking at the bounds of what's the minimum number of tokens that came in, the max, uh, min, min out 00:30:21.240 |
and max out, and just looking at that over the entire distribution. Yes, sir? 00:30:25.800 |
Oh, well, yeah, right there. When it comes to those, uh, 00:30:31.240 |
inference strategies you talked about, like Lilo and Liso, um, how do you, what kind of strategies do 00:30:35.880 |
you have to manage? Like which ones are used at, like, because obviously each session is going to be 00:30:40.920 |
pretty generic. You don't know which one to use at first. Correct. Um, do you split those between GPUs 00:30:45.320 |
or do you stick with one and does it switch between...? So typically we'll, we'll go to, 00:30:49.240 |
you try to find what's one configuration that will manage the, the plethora of types of requests that 00:30:55.480 |
you have coming in. So we, we're typically at a, a one engine per all the different querying types. 00:31:01.400 |
And I think you'll start seeing, I'm giving you a little bit of future ways to think about it on the 00:31:05.720 |
DevOps side because that's something you'll have to test, right? If I look at this querying pattern 00:31:11.320 |
that came into my system with this engine, if I switch the engine, does it still satisfy the querying 00:31:16.440 |
pattern? And how much cost does it save? How much faster is it? So that's more of a, an engineering 00:31:21.720 |
exercise that you'll have to deploy. Sorry, I, I didn't have a... 00:31:24.120 |
Yeah, yeah, so I, I just, I'm very interested in the seasonal side just because, 00:31:28.920 |
okay, querying patterns will change. Um, especially when agents come, it'll just be, 00:31:33.880 |
that's going to get super interesting when agents are just throwing stuff. Yes, sir? 00:31:37.720 |
Um, so question about how you measure quality of attention. Um, is it, is it correct 00:31:43.720 |
intuition to think that attention is a fundamentally scarce resource in the sense of, it's about paying 00:31:48.440 |
attention to one thing at the expense of other contexts? So then how can you, like, scale attention 00:31:54.840 |
mechanisms infinitely the way we can conduct? Yeah, so, so what people do in order to scale the attention 00:32:01.560 |
mechanism is, here's another interesting fact that, um, why folks don't train huge context models, 00:32:07.720 |
because it's actually, now you've seen, the bigger my, uh, prompt, the more memory I need. So imagine 00:32:13.400 |
what that does to a huge, I don't know, 10,000, 100,000 GPU deployment. It might make it a million GPUs 00:32:20.360 |
just to do that context length. So people will train the small context length and then interpolate 00:32:26.040 |
in that value to, to give you that length of context length and then you're sort of bound to what 00:32:30.760 |
attention mechanism you were using. Designed there, there's things like flash attention that will 00:32:35.320 |
just do everything in the L1 cache really, really fast. So it depends on the speed of some of the 00:32:41.000 |
different, it also depends on the GPU as well. So that's why, um, if you look at Blackwell that was 00:32:46.120 |
announced by Jensen, they literally have connected, I think, 72 different GPUs on one NVLink. So NVLink 00:32:54.040 |
connects GPUs together, that's how we can move data insanely fast, and now we've connected like 72 GPUs 00:32:59.560 |
on one. That's, that's just to show you, um, like mixture of experts trying to compute attention across 00:33:05.320 |
all of these different things. But that's actually a really good question. 00:33:10.840 |
No, I don't necessarily think so. Like the entire industry is, you know, going after that problem. 00:33:16.360 |
That's why everybody wants to maybe see something other than attention and, ah, you know, there's so much 00:33:21.480 |
excitement there. Yeah, unfortunately you don't have to call Tom on us now, but that's been fantastic. Thank you, Mo.