Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

00:00:00.000 | -

00:00:02.000 | - It's very difficult to teach extremely technical material

00:00:15.680 | in about 20 minutes.

00:00:16.680 | Initially, I had planned for at least a 45-minute session,

00:00:19.600 | so I left some reading material for you at the end.

00:00:22.440 | And all of the resources, you could download slides

00:00:25.160 | and everything, so feel free to take screenshots or not.

00:00:27.920 | And so I work at NVIDIA, I'm a solutions architect,

00:00:30.720 | so I work primarily with retail clients,

00:00:32.440 | and it's my job to essentially work with those clients,

00:00:35.160 | understand sort of what their main challenges are.

00:00:38.080 | This is data processing, computer vision,

00:00:40.560 | across all of the different use cases,

00:00:42.720 | and then now I'm focused on LLM inference.

00:00:45.360 | So my hope today is that you get a better intuition

00:00:49.480 | of exactly what's happening with this particular workload

00:00:52.160 | and how you go about, to some degree, sizing things,

00:00:55.200 | choosing different GPUs, et cetera, et cetera,

00:00:57.680 | and more importantly, controlling the cost of a deployment,

00:01:00.000 | 'cause that's oftentimes the thing

00:01:01.600 | that's going to really prevent you

00:01:03.600 | from taking this to any meaningful scale

00:01:07.360 | is that overall cost of a deployment.

00:01:09.680 | Most folks that I've seen are doing some kind of hybrid,

00:01:12.280 | so you choose a big box API,

00:01:14.080 | you have some set of queries that go there.

00:01:16.040 | In addition, you have some set of queries

00:01:18.080 | that go to some open-source hosted model

00:01:21.080 | or some fine-tuned model that you have internally.

00:01:25.160 | So just reference, if you go to build.nvidia.com or ai.nvidia.com,

00:01:30.080 | everyone can get 1,000 inference requests for free.

00:01:33.040 | So I typically recommend this to folks who are benchmarking

00:01:36.960 | different types of open-source models.

00:01:38.680 | We have all of those models hosted.

00:01:40.480 | It's optimized.

00:01:41.400 | If you're teaching a course and you are trying to evaluate

00:01:44.680 | all of the different LLMs that are out there for your business,

00:01:47.640 | there are also multi-modal LLMs, speech LLMs.

00:01:51.040 | Every model that NVIDIA accelerates will be available there for you.

00:01:55.520 | And that's sort of a path to you to either go optimize them yourselves

00:01:59.280 | or to work with us.

00:02:00.560 | You'll see things about NVIDIA inference microservice

00:02:03.600 | and all of those things that you can take to enterprise.

00:02:06.280 | So we have sometimes the, I'll call it the rocky road,

00:02:10.360 | and then there's smooth roads.

00:02:11.200 | Whatever path you want to take, we're here to support you.

00:02:15.200 | In terms of agenda, very simple.

00:02:17.040 | I want you to understand the LLM inference workload,

00:02:19.960 | and then we'll move to how you go about measuring a production deployment

00:02:24.200 | and some of the things you need to be watching.

00:02:26.200 | It's a little more than, let's say, you know, the total time to generation

00:02:30.080 | and really understanding what's happening on the GPUs as you sort of scale out.

00:02:34.160 | Even if you have a single GPU,

00:02:35.720 | I think it's very important for you to just have that intuition.

00:02:38.480 | And then lastly, I'll show you some software that you can use,

00:02:41.840 | some open source packages that you can use,

00:02:43.480 | and then point to some paid offerings.

00:02:46.400 | Okay, we're going to get into the LLM inference workload itself.

00:02:50.360 | So the first part is really understanding what happens

00:02:53.320 | when you send a prompt onto the GPU.

00:02:56.200 | So I have this example here.

00:02:57.200 | I'm saying, okay, write me a presentation so I sound smart.

00:02:59.840 | I come to the AI engineer conference,

00:03:01.840 | and you guys are maybe going to like to talk.

00:03:04.160 | And essentially what I'm going to do is I'm going to put that on the GPU.

00:03:06.640 | So the moment that I send that prompt on the GPU, it stays on the GPU.

00:03:10.520 | So think about that.

00:03:11.800 | And then from there, I'm going to generate one token at a time.

00:03:14.680 | So I'm generating the tokens, LLM inference is hard,

00:03:18.160 | and I put the timestamps T1 through T4.

00:03:21.240 | So in every single deployment,

00:03:22.720 | no matter how fast anyone claims they're doing things,

00:03:26.320 | it's typically one token that's generated at a time.

00:03:29.360 | that's very important to understand.

00:03:31.320 | The next thing is that in order for an LLM to give you a coherent answer,

00:03:35.800 | just like how you speak,

00:03:37.400 | you have to remember every single thing that you said before,

00:03:40.240 | and that you'll understand the mechanism of how LLMs are able to do that.

00:03:45.240 | So that's why I'm putting LLM inference is in red and putting that back onto the GPU.

00:03:50.240 | So every token that I generate gets locked onto the GPU, and then you'll actually see what that looks like in terms of vectors.

00:03:56.680 | How many of you have heard of KvCache before?

00:04:01.120 | Okay, some of you.

00:04:03.120 | Typically, I don't see maybe many leaders hear about this thing called KvCache.

00:04:07.120 | KvCache is this thing that really drives to some degree the cost.

00:04:11.640 | So whether or not you use some big box API or you're using a single GPU,

00:04:17.080 | it's all the same sort of mechanisms, the same algorithm that everyone is trying to solve.

00:04:21.080 | So in terms of steps, here I like to, as I said, sharpen your intuition.

00:04:27.880 | So the first thing, if we move from the left, my first job is to convert these texts,

00:04:31.760 | whatever texts that you send, we're going to focus on LLM inference,

00:04:35.320 | into some words that the model understands.

00:04:37.800 | So the model will have its own vocabulary, and it's my job to translate that.

00:04:41.400 | I'll give you the technical terms coming up after that.

00:04:44.600 | And the first thing that happens is I do some initial prompt processing.

00:04:48.200 | So I have to compute the attention mechanism on the entire prompt.

00:04:52.600 | I repeat that.

00:04:53.640 | I have to compute the attention mechanism on the entire prompt per user.

00:04:57.800 | So if I have a million people hitting my service and a million people send 10,000 tokens,

00:05:02.840 | that's a million times 10,000 attention mechanisms that I need to compute,

00:05:07.560 | also while generating tokens for other people.

00:05:10.360 | So it's good for you to appreciate sort of that complexity that's happening.

00:05:14.600 | And once I finish processing that prompt,

00:05:16.760 | then I'm going to start generating one token at a time.

00:05:19.400 | And that typically happens very fast.

00:05:21.000 | And then from there, every token that gets generated, that's in the LLM's vocabulary,

00:05:26.360 | I need to now de-tokenize that back into your language.

00:05:30.200 | So here's the technical terms that you'll see when you read the literature,

00:05:34.040 | or you read super technical documents.

00:05:36.680 | First is tokenization.

00:05:38.280 | Each model will have its own tokenizer.

00:05:40.360 | And the thing to think about, when you think of tokenizers, when they did pre-training,

00:05:46.760 | they downloaded the internet and some, right?

00:05:49.960 | And they cleaned it up, et cetera, et cetera.

00:05:51.800 | So tokenizer, and as you start thinking of the complexity across languages, coding languages,

00:05:56.840 | regions, et cetera, et cetera, they tried to get what is the minimal set of character groups that can

00:06:04.600 | represent this entire training data set efficiently.

00:06:07.240 | Because it's really all about efficiency.

00:06:09.400 | So for instance, the LLAMA tokenizer has 128,000 tokens, all right?

00:06:13.880 | And I'll talk a bit more about that.

00:06:15.160 | So here's what it actually kind of looks like on a GPU.

00:06:18.200 | So I tokenize, the LLAMA understands it.

00:06:20.680 | I go into this thing called pre-fill.

00:06:22.360 | Pre-fill is the stage where you compute the attention mechanism.

00:06:25.400 | And many people are doing advancements with attention mechanisms.

00:06:29.000 | I'll talk a bit more about that.

00:06:30.520 | So there are tons of different schemes.

00:06:32.360 | People leverage all of the different types of memory hierarchies in GPUs to really accelerate this type of workload.

00:06:39.800 | And then I start generating tokens one at a time.

00:06:42.520 | The red and the green just signify, hey, I'm storing those tokens on the GPU.

00:06:46.760 | The green is the latest one that I sent out.

00:06:49.160 | So hopefully that makes sort of intuitive sense.

00:06:51.400 | All right.

00:06:52.680 | So the other thing I want you to visualize, I think it's nice to visualize what is the actual data that sits on the GPU.

00:07:01.720 | So first, a token is approximately four characters.

00:07:04.840 | That's a nice way for you to think about it.

00:07:07.160 | So from here, I have two vectors.

00:07:10.680 | So the first vector is just showing token one through token V.

00:07:14.200 | V is the total number of tokens that I have in my tokenizer.

00:07:16.920 | And the second vector below is just I have some numeric index.

00:07:21.640 | I don't want to keep using the token to reference itself.

00:07:24.520 | I just use the number as a lookup.

00:07:26.760 | So my job when a prompt comes in is to convert that text into those token, what I'm going to call token IDs.

00:07:35.400 | So I have make me sound smart and make me sound smarter.

00:07:38.520 | You see two vector sets of tokens.

00:07:40.680 | And the key thing I want you to walk away from that distinction is that an LLM token is not a human word.

00:07:48.040 | Sometimes it is.

00:07:48.920 | Sometimes it's not.

00:07:49.720 | It's typically some sub-party words.

00:07:51.720 | You'll see weird symbols when you look at tokenizers from different models.

00:07:55.960 | But you want that first framing.

00:07:57.320 | So now we have text.

00:07:58.440 | We hit to a vector.

00:07:59.400 | So from there, each one of those LLM tokens had a corresponding embedding vector.

00:08:06.920 | So embedding vector is everything.

00:08:08.280 | We embed videos.

00:08:09.960 | We embed images.

00:08:11.000 | We embed text tokens.

00:08:13.000 | Think of it as a representation that an LLM can use to compare things and do math on.

00:08:19.160 | So that's why we always want to convert into some vector representation.

00:08:23.960 | Because some vector representation is just some high dimensional coordinate space.

00:08:28.120 | And we're just rearranging objects.

00:08:30.280 | That's to some degree what you're doing.

00:08:31.720 | Okay.

00:08:33.160 | So from those token IDs, I went to the actual embedding vectors themselves.

00:08:38.920 | So if you look, make me sound smart, now becomes a matrix.

00:08:41.720 | All right.

00:08:42.760 | Make me sound smarter becomes a matrix with an extra column.

00:08:46.120 | So in reality, what you're doing every time you submit a prompt,

00:08:49.800 | I don't care what LLM you submit it to, who you submit it to, this is what you're doing.

00:08:54.440 | All right.

00:08:55.240 | You are converting your text, now images as well.

00:08:58.440 | They get converted to some IDA tokens or something like that.

00:09:01.640 | That'll be another interesting talk to do diffusion models, et cetera, et cetera.

00:09:05.640 | But you're really putting this large matrix on the GPU.

00:09:09.880 | So the next question you should ask is, okay, why are GPUs good for this workload?

00:09:14.520 | Because they process matrices really, really fast.

00:09:17.320 | So that's sort of the advantage and the thing, hopefully that makes a lot more sense to you.

00:09:22.360 | Now, the next thing I want to talk about is how the LLM is going to process these tokens.

00:09:27.320 | And I'll keep in mind if any, well, I'm not even going to ask you to raise your hand.

00:09:31.320 | I'm 100% sure each of you has used an LLM.

00:09:33.640 | If you have not, I'm not sure what's happening.

00:09:38.200 | The other one is the attention mechanism.

00:09:39.800 | I truly think it's one of the things that you should understand.

00:09:45.480 | If we ever drift away from it, that's fine.

00:09:47.800 | But the fundamentals of that mechanism and seeing sort of the innovations around that,

00:09:52.840 | I think, can help anyone, any business leader, et cetera, et cetera,

00:09:56.040 | just because you are able to speak a different kind of language in this generative future.

00:10:01.320 | So as you think of the attention mechanism, the intuition that you should have is just

00:10:05.240 | that mechanism of relating tokens.

00:10:07.480 | How do I distinguish in a sentence what is important?

00:10:10.360 | And then for the next token that's going to be generated, hey, what tokens that I said before

00:10:16.120 | were really important for me to make a next good decision for that next token?

00:10:20.120 | So that's the intuition.

00:10:21.960 | And now we're going to -- we won't necessarily touch too much of the math,

00:10:24.520 | but I want you to see sort of what's happening on the GPU.

00:10:27.720 | So once again, the prompt comes in.

00:10:29.240 | I'm just going to do a short one, make me sound smart.

00:10:31.800 | I'm going to generate this token called LLM, all right?

00:10:34.840 | We saw these same matrices that I said before.

00:10:37.960 | So remember, my text now turns into a matrix hitting onto the GPU.

00:10:41.960 | And the main thing I want you to understand or visualize here is actually how

00:10:48.520 | an LLM memory works.

00:10:49.720 | So now when you're speaking, you've recorded everything that I've said for the last

00:10:54.680 | 10 minutes in your brain, somewhere it's stored.

00:10:57.960 | So now you're going to see how the LLM is storing what it is that you just said.

00:11:01.320 | So from there, a lot of folks will hear about these query key and value matrices.

00:11:06.840 | This is what the actual model weights look like.

00:11:09.080 | So when you look at a model weights file, if you go on Hugging Face, there's typically a JSON file

00:11:13.720 | that will show you all of the different pieces of model files.

00:11:16.920 | And you'll see this thing called Q, K, and V. So I have these model weights.

00:11:20.920 | So now I've went from text to a matrix.

00:11:23.720 | I'm going to matrix multiply against the weights of the models.

00:11:26.440 | So now I get these three output models.

00:11:29.320 | So think of these weight matrices that I showed here.

00:11:32.520 | Think as -- when you're doing a projection, what you're doing is you're taking

00:11:37.000 | some coordinates and you're putting it into a different space.

00:11:40.360 | That's really what you're doing when you do vector matrix math.

00:11:43.160 | So now when I do this matrix multiplication, this query key and value matrix -- so if you look at

00:11:48.840 | different tutorials on attention, you'll see these things pop up a lot.

00:11:52.280 | So hopefully that will help you to read it a lot more.

00:11:54.440 | This is now the LLM's interpretation of each of those tokens that you sent in.

00:11:59.960 | Right? And now the job is how do I now take these query key and value matrices and sort of interpret it

00:12:07.240 | to try to generate the next best token. And this is just happening constantly over and over every single

00:12:12.920 | token that's happening. But the key thing I want you to walk away on the slide is where I drew the key

00:12:17.640 | and the value. Right? When people talk about KV cache optimization, every LLM performance engineer is just

00:12:24.600 | literally trying to make that thing as fast and small as possible. And that will make a little more sense

00:12:30.440 | as to what that does to your cost. But ultimately, these key and value matrices, this is like your LLM's

00:12:35.720 | memory. So it will make a little more sense coming up. I know I didn't show a ton of the math. I show

00:12:39.960 | some tutorials afterwards so you can go read more about that. My intention here is for you to visualize

00:12:45.320 | key and value. So every time you see a prompt, I just want you to thinking, crap, key and value is on

00:12:49.720 | my GPU. Okay? The next. So here's the real value of the KV cache. So remember we said that whenever I

00:13:00.120 | generate a token, I'm going to push it back into the GPU. Right? So every token I generate, it goes back

00:13:05.560 | into the GPU. And then I have to compute an attention mechanism. So this is what's happening. This new

00:13:10.200 | token I generated, LLM, I get its vector representation, as you see in blue. But now I do that vector matrix math

00:13:18.920 | now. So before I did matrix matrix math, that's my first prompt first comes in. I generated my first token.

00:13:26.120 | Now I'm doing vector matrix math. You know, people will batch this across all requests, but I'm just showing you

00:13:32.360 | a single request so you can see it. Now, the value of the KV cache is, if I were to, if I didn't have the KV

00:13:40.840 | cache, I would have to reprocess all of that work I did on the prompt that I did before. So this is the benefit of your KV

00:13:48.120 | KV cache. Now I'm just going to compute attention on this newest token. How does this new token relate

00:13:53.400 | to everything that I said before? That's the thing that's really happening intuitively. So if I have

00:13:59.320 | this KV cache, my generation is going to be fast. And it's really up to what's called the batch manager

00:14:06.520 | on the GPU to make sure that I'm just pushing out as many tokens as possible. Okay. So if you look at

00:14:12.920 | an LLM, these groups of three matrices are called an attention head. There are more matrices than that,

00:14:18.520 | but these are the main ones. LLAMA has 32 attention heads. So I just kind of want you to appreciate

00:14:23.800 | what an LLM really looks like. All right. So I have 32 sets of these matrices. I have 32 of those KV

00:14:30.200 | caches happening at the same time. And now I have to combine all of that to then generate the next

00:14:34.840 | token. So there's an incredible amount of work that happens in a very short space of time to give you a

00:14:40.840 | coherent token. Okay. A good mental model for you to keep in your head -- I'm going to speed up a little bit -- is

00:14:47.000 | to -- if you see the number of parameters, multiply that by two, and that is your FP16 gigabyte memory on

00:14:54.120 | the GPU. So if you have, let's say, an L4, I think is 20 gigs, and I have a Lama 8B, that's automatically

00:15:01.800 | 16 gigs FP16. So I only have four gigs left for my KV cache. So on the GPU, it's either the model weights

00:15:09.480 | or tokens. That's it. There's nothing else on the GPU. And I have a thing to read on that. This is a

00:15:14.760 | really good blog. It shows you all of the different optimizations that you can do.

00:15:19.720 | Okay. Now let's talk about measuring. So if you ever see this thing called ISL or -- do I have it

00:15:27.000 | there? Oh, sorry. ISL or OSL, that's input sequence link, output sequence link. So now I want you to

00:15:33.560 | see what some advanced monitoring might look like. If any of you are DevOps folks, these are things that

00:15:37.880 | you want to record. The first thing that we measure is time to first token. So how long does it take me to

00:15:42.920 | generate? Process the prompt and then generate my first token. And that's typically a measurement

00:15:49.000 | of how good your attention mechanism processing is. That's really what you're trying to suss out.

00:15:54.840 | So that's time to first token. Into token latencies. So after I've generated my first token,

00:15:59.880 | every single token after that, I'm looking at those individual spaces. So everything that's going to

00:16:04.680 | happen there, think about when the system is on the load. I have, you know, a thousand requests coming

00:16:09.800 | into my system, I'm generating a thousand sets of different tokens. And the more memory I occupy,

00:16:14.760 | typically that slows down processing. So if you start to see drift in this metric, then -- so I'll show

00:16:19.800 | you some plots that you can look at. And then time to total generation. How long did it take me to

00:16:24.280 | initially get the prompt, fully finish the answer. All right? Super intuitive. Like I said, ISL,

00:16:29.880 | OSL, that's all that means when you see them on the plots coming up. Okay. This is a very important

00:16:36.520 | paradigm for you to understand in your mind. So I worked with a lot of folks on, you know, maybe

00:16:41.880 | Rex's deployments or deployments of other types of models. So on the GPU, if you're only deploying one

00:16:47.640 | model on a GPU, outside of LLM inference, in my opinion, I think you're wasting the GPU. You can put

00:16:52.920 | multiple models on the GPU to actually increase your throughput. That's why it was really created.

00:16:57.240 | So this is a slide -- excuse me. This figure is just showing I can have multiple models. I have some

00:17:03.800 | space for data and that's how I increase my throughput per unit hardware. However, on the LLM inference side,

00:17:09.720 | it's very different. I have one model. You know, folks can fit multiple models on a GPU. That's cool,

00:17:14.680 | but that's not a real production use case. You'll typically have a single model. The remaining space

00:17:19.400 | that you have is all for KVCache and generating all those tokens. So I just put four different

00:17:24.280 | requests and I just kind of want you to see the boxes that are happening. Okay. I would say this is

00:17:29.400 | the most important slide in the entire presentation because this is the thing that will determine

00:17:33.800 | both your cost and performance. So there are four different querying patterns that happen. And this

00:17:39.720 | is something that you must measure in your deployment because oftentimes you might read benchmarks and

00:17:45.000 | just say, all right, they'll cherry pick one or two of these. But in reality, in your production

00:17:49.880 | system, you might have several of these different patterns that are occurring. So let's take a look

00:17:54.520 | at the first one. Long input, short output. So long input means it's going to take me technically longer

00:18:01.160 | to compute the attention mechanism. So my pre-filled stage will be longer. It occupies more memory from

00:18:06.760 | my prompt. Does that make sense intuitively? Hopefully it's grabbing you. But then on the generation side,

00:18:12.360 | I don't generate much tokens. So there's not much -- those tokens are not picking up a lot of memory.

00:18:17.400 | And they will tend to finish fast. So the second one, or maybe the most costly use case is,

00:18:23.560 | so I have clients that will message me and say, hey, my data scientists are putting two bigger prompts

00:18:28.200 | on my GPUs. So now they're killing my deployment. Because if everyone went and put the maximum context

00:18:33.800 | length, I can only fit so many requests on the GPU. So that's something for you to think about. You'll have to

00:18:38.920 | manage that internally with your deployments. So that's why I'm putting, you know, okay, the GPU is

00:18:43.960 | really full. Because long text -- excuse me, long input, long output. The next one, short, long,

00:18:49.080 | you know, your time to first token will be really fast. I don't have much to compute the attention mechanism

00:18:53.880 | on. But hey, I'm generating a ton of tokens. That's really, really fast. So hopefully, as you start

00:18:59.080 | measuring these types of different query patterns, you'll see different results. I just put, you know, what a random

00:19:05.800 | sampling set might actually look like on the GPU. Because not everyone will send the same length of

00:19:10.600 | input and output. So that will -- it'll be good for you to just sort of visualize and track these

00:19:15.640 | statistics. More importantly, why we're doing that internally -- I'm going to steal the time here, Peter.

00:19:21.400 | More importantly, why we're doing that or why we're tracking these things is that the whole goal is to

00:19:27.160 | build -- I have a big model. My goal is to shrink it as much as I can, but to keep it as accurate as possible.

00:19:33.960 | So the more that I shrink, the faster it runs, the more GPU memory I have for what?

00:19:38.840 | Tokens. All right? So that's how you really try to improve your cost. This is why I'm sort of

00:19:44.680 | proposing to you to build inference engines. So all I'm showing here is a 2D histogram of

00:19:49.880 | input sequence length versus output sequence length. Because the question that you'll have to answer is,

00:19:54.600 | hey, how long are my actual prompts? Someone might say, okay, here's the max prompt length that you can

00:20:00.760 | ingest, and the max prompt you can out -- excuse me, get on the output. And all of the big box model

00:20:07.000 | providers have to estimate this when they go into costing or providing a service to you, right?

00:20:12.200 | Because they have to host all of that machinery under the hood now that you understand what's happening.

00:20:17.640 | So we use this to statistically determine what is the max input sequence length and the max output

00:20:24.280 | sequence length across all of my users. And this will give you a really good indication of how you can

00:20:29.880 | size your engines. We use that to actually build more optimized engines. In addition, it will just

00:20:35.960 | give you good view as to maybe what you call it, scaling out and things like that. The next one is

00:20:41.800 | time-to-first token analysis. Remember, time-to-first token is measuring my performance of the attention

00:20:47.080 | mechanism on the load. So someone might show attention mechanism at one query. Show me

00:20:53.240 | attention mechanism on the load. When this thing is fully maxed out 24/7, that's when you really need

00:20:58.840 | to start measuring these types of things. So this is something you can look at. These are sort of

00:21:02.520 | experimental plots. There's a package called GenAI Perf that will be released open source. It's out

00:21:08.520 | already. I have a link to it there. This is where it will generate these plots for you. But I'm just

00:21:13.560 | showing you what the engineers are looking at internally to measure the performance of the

00:21:18.040 | compute platform. Next, time to completion analysis. How long did it take me to go from start to finish

00:21:23.320 | across every single request. Naturally, the wider that box plot, you have to intuitively ask what's

00:21:30.360 | happening. Why did this person's prompt take longer than another? So you can investigate either batching

00:21:36.040 | issues, scheduling issues, different things like that. I'll take questions in the end. Oh, I have to move

00:21:40.920 | really fast. Sorry there, Peter. Okay. I'm going to speed up here. Token to token latency. Peter,

00:21:45.880 | how much time I got? Oh, you're fine. We'll definitely have time for the question.

00:21:49.400 | Okay, cool. I'm going to steal. I'm definitely over. Sorry. I realize I may have gone a little too

00:21:54.360 | fast. So forgive me for that. No, you have five minutes. Cool. All right. Token to token latency.

00:21:59.880 | So that is I'm generating tokens. I'm looking at that spacing versus token position. So the longer a sequence

00:22:06.920 | sequence gets, remember, my memory grows. So typically that means that system is under more

00:22:11.880 | load. It has more throttling that might happen under high load of requests. So if I see a large

00:22:18.280 | variation in token to token latency as the sequence gets longer when I'm generating, that means I'm not

00:22:25.160 | very performant. All right. So we look at that to see I try to make sure that that's constant,

00:22:29.720 | no matter how much tokens I'm generating. That means I'm really proficient. Okay.

00:22:34.600 | Last one would be time to first token versus number of input tokens. So time to first token,

00:22:41.480 | remember, is computing the attention mechanism, okay, versus number of input tokens. So if I have a bigger

00:22:47.480 | prompt, my attentions will take longer. But if that plot goes up, like, from your perspective,

00:22:53.400 | it goes up like this in terms of sequence length, that's not really good performance. We really look at

00:22:58.120 | that slope and we try to get that slope almost, you know, as low as possible. So if you send me this

00:23:04.040 | long sequence, I can get that thing done really fast. Okay. Okay. In terms of software, you will see

00:23:10.520 | this thing called TRT LLM. Triton is an open source inference server. So you can deploy models on CPU,

00:23:17.000 | on GPU, computer vision, Reksis, Python, PyTorch, TensorFlow. It will host all of the different types of models.

00:23:23.960 | So there's one way that your deployment team deploys. All their data scientists are happy

00:23:28.040 | because they don't have to do conversion. You're happy as a deployment person because you don't

00:23:31.720 | have to manage a TorchServe versus TFServe and Flask and all of it is done through one. It's written

00:23:37.560 | in C++, blazingly fast. And then the other thing you'll see in video, you'll see a lot more coming out

00:23:42.600 | of NVIDIA's NVIDIA inference microservice because building these engines, getting them deployed,

00:23:47.240 | optimize the scale, it's not easy. So we've sort of made that easy for you as an enterprise offering,

00:23:51.560 | but you guys can try it out for free. Okay. So TRT LLM, let me just give you lots of stuff

00:23:57.800 | on this slide. But the main thing I want you to walk away with is this is the model compilation package

00:24:03.560 | for LLMs on NVIDIA GPUs. If you want to get best performance from NVIDIA GPUs, please make sure

00:24:09.240 | you use TRT LLM. Unnaturally, once we're investing more in NIM, you'll see some more things come out.

00:24:15.560 | So you'll see performances on A100 and H100. Really focus on FP8 GPUs. So FP8 will be Hopper and Ada Lovelace.

00:24:24.520 | Okay. So FP8, I'll talk a bit more about that, what the advantage there is. But mainly is if I go from FP16,

00:24:31.720 | FP8 is this. Half my memory. Almost the same accuracy. And so we measure the accuracy and we publish the

00:24:38.200 | accuracy. So now I have this much more space for tokens. But more importantly, this model is that much

00:24:43.480 | faster. Okay. So I want you to understand where the sort of industry is going. This is why Hopper,

00:24:48.600 | the world ate Hopper for breakfast and lunch and dinner because of FP8. It gave folks that

00:24:54.120 | cost benefit to do this thing a lot faster. Okay. In-flight batching, it just means I don't have

00:25:00.840 | to wait for all the requests to finish to start a new request. The moment your request finishes,

00:25:05.640 | I can inject a new request while others are going. Okay. Tons of features here. I put the features.

00:25:11.720 | So some ones to focus on are quantized KV cache. So I can actually represent my KV cache in different

00:25:19.240 | precision. So that means I'm actively shrinking that memory,

00:25:23.720 | making it more performant. You have page KV cache. That's just you managing your GPUs a lot better

00:25:29.240 | in terms of all of that memory. So there are tons of things you can do.

00:25:32.600 | Tenser parallelism. The thing to remember about tensor parallelism, if you want to increase latency,

00:25:37.720 | use tensor parallelism. Split the model up across multiple GPUs. That's typically done within a node.

00:25:44.440 | I repeat that. That's typically done within a node. You don't like to do tensor parallelism across a node.

00:25:49.960 | You'll see pipeline parallelism go across a node. Pipeline parallelism is more sequential,

00:25:54.760 | so I process this chunk. So in a multi-node model, like huge models, this box will finish and pass off to

00:26:01.960 | the next box. But most folks will typically just work -- most models will work within a single node.

00:26:07.320 | So those are some of the things. In terms of models that you have access to, we optimize those models

00:26:22.280 | and we give you a lot of scripts where you can go do that on your own, or you can sort of take our

00:26:26.280 | software and take an easy path. Either way, we support you. So here are some of the models that

00:26:31.480 | are there. All of the LAMAs, Mixtrel, Mixtrels, we work with all those teams behind the scenes. So

00:26:36.440 | typically before any foundation model comes out, we work with those teams to get them deployed.

00:26:41.480 | Okay, what does it mean for TensorRT? So you might have seen TensorRT before, which was a deep learning

00:26:48.200 | compilation package for NVIDIA GPUs. Lots of folks in computer vision, et cetera, et cetera, have used that.

00:26:53.720 | We took the best practices from there and added all of the extra things that need to happen

00:26:58.440 | in the LLM inference loop. So that's what TRT LLM is really about. So mainly focus on LLM inference.

00:27:05.720 | Here's a good visual. An engine that's built to a specific GPU cannot be moved to another GPU. So you

00:27:12.840 | always have to compile to that GPU. That's why it's that performant, because we really leverage

00:27:18.440 | all of the actual hardware on that system to rewrite the algorithm, rewrite that model to that specific

00:27:24.680 | piece of hardware. Okay. TRT LLM and Triton. So TRT LLM will give me an inference engine. I need something

00:27:32.200 | to host that inference engine and accept requests, batching, et cetera, et cetera. So we have Triton.

00:27:37.400 | Triton works very simply. It's literally a folder where you specify it works on tensor in and tensor out.

00:27:43.880 | So it will tell you what are my inputs coming in and out. And then it will basically understand

00:27:48.600 | how to interpret that file. Or you can host any other different models. That's a thing I do a lot

00:27:54.120 | with folks. Just two more slides. This is where the future of inference is going. So a lot of folks

00:27:58.920 | do FP16 inference today. A lot of folks are moving towards FP8 just because, hey, I now half the model size,

00:28:08.040 | almost twice the speed, more space for tokens. It just makes more sense from a cost perspective. That's

00:28:13.560 | why folks like that. And then you saw Blackwell was announced. That's the major innovation. I get FP4.

00:28:19.480 | So that's where things are really going to get interesting. I'll end with NVIDIA inference microservice.

00:28:25.080 | So we've made this thing really easy. We've gone and actually found the best configurations for all of

00:28:29.720 | these models on each piece of GPU. And we're slowly rolling out all of the models because it, you know,

00:28:35.080 | will just take some time to optimize the world, essentially. And yeah, you can use this to

00:28:40.360 | download all the slides. I put papers. Tons of other things for you to read. So, yeah. Hopefully,

00:28:46.200 | your intuition has sharpened.

00:28:47.800 | Shall we just conclude with the, because if someone had a question.

00:28:56.440 | Sure. Yeah, I think.

00:28:57.160 | Where was the question?

00:28:58.040 | Yeah.

00:28:58.840 | Oh, hang on. I'm going to come over and point my mic at you.

00:29:01.640 | Thank you.

00:29:06.440 | So, hi. Sorry. My question is actually on the heat map that you shared.

00:29:11.880 | Yeah, yeah.

00:29:11.880 | Do you mind walking through the heat map and how to interpret it?

00:29:15.400 | Because it was a little small.

00:29:16.680 | Yeah, sorry about that. Yeah. So the heat map,

00:29:18.520 | Thanks.

00:29:18.920 | All I'm looking at is,

00:29:20.040 | um, so when you go to build an engine, you build an engine to the max input sequence length and the max

00:29:26.200 | output sequence length. So we actually change how that matrix math is happening under the hood based on

00:29:32.120 | those settings. So you might say, all right, my users are only going to send 4,000 tokens. But in reality,

00:29:38.680 | they might have been sending 1,300 over the past week that you measured. So now you can stay with

00:29:43.720 | statistical certainty that, hey, the majority of people that were serving, um, during this time,

00:29:51.000 | these were the querying patterns. So I can rebuild an engine for that period of time. What gets super

00:29:55.960 | interesting, this is a topic I'm very interested in, is seasonal engines. So during the day, you have

00:30:00.920 | different querying patterns. So you'll scale down, you'll scale up. And so you might have different

00:30:05.880 | engines built for different types of querying patterns based on traffic and stuff like that.

00:30:10.520 | So hopefully that may have answered the question, yeah. But it's just saying, you know,

00:30:15.560 | looking at the bounds of what's the minimum number of tokens that came in, the max, uh, min, min out

00:30:21.240 | and max out, and just looking at that over the entire distribution. Yes, sir?

00:30:25.800 | Oh, well, yeah, right there. When it comes to those, uh,

00:30:31.240 | inference strategies you talked about, like Lilo and Liso, um, how do you, what kind of strategies do

00:30:35.880 | you have to manage? Like which ones are used at, like, because obviously each session is going to be

00:30:40.920 | pretty generic. You don't know which one to use at first. Correct. Um, do you split those between GPUs

00:30:45.320 | or do you stick with one and does it switch between...? So typically we'll, we'll go to,

00:30:49.240 | you try to find what's one configuration that will manage the, the plethora of types of requests that

00:30:55.480 | you have coming in. So we, we're typically at a, a one engine per all the different querying types.

00:31:01.400 | And I think you'll start seeing, I'm giving you a little bit of future ways to think about it on the

00:31:05.720 | DevOps side because that's something you'll have to test, right? If I look at this querying pattern

00:31:11.320 | that came into my system with this engine, if I switch the engine, does it still satisfy the querying

00:31:16.440 | pattern? And how much cost does it save? How much faster is it? So that's more of a, an engineering

00:31:21.720 | exercise that you'll have to deploy. Sorry, I, I didn't have a...

00:31:24.120 | Yeah, yeah, so I, I just, I'm very interested in the seasonal side just because,

00:31:28.920 | okay, querying patterns will change. Um, especially when agents come, it'll just be,

00:31:33.880 | that's going to get super interesting when agents are just throwing stuff. Yes, sir?

00:31:37.720 | Um, so question about how you measure quality of attention. Um, is it, is it correct

00:31:43.720 | intuition to think that attention is a fundamentally scarce resource in the sense of, it's about paying

00:31:48.440 | attention to one thing at the expense of other contexts? So then how can you, like, scale attention

00:31:54.840 | mechanisms infinitely the way we can conduct? Yeah, so, so what people do in order to scale the attention

00:32:01.560 | mechanism is, here's another interesting fact that, um, why folks don't train huge context models,

00:32:07.720 | because it's actually, now you've seen, the bigger my, uh, prompt, the more memory I need. So imagine

00:32:13.400 | what that does to a huge, I don't know, 10,000, 100,000 GPU deployment. It might make it a million GPUs

00:32:20.360 | just to do that context length. So people will train the small context length and then interpolate

00:32:26.040 | in that value to, to give you that length of context length and then you're sort of bound to what

00:32:30.760 | attention mechanism you were using. Designed there, there's things like flash attention that will

00:32:35.320 | just do everything in the L1 cache really, really fast. So it depends on the speed of some of the

00:32:41.000 | different, it also depends on the GPU as well. So that's why, um, if you look at Blackwell that was

00:32:46.120 | announced by Jensen, they literally have connected, I think, 72 different GPUs on one NVLink. So NVLink

00:32:54.040 | connects GPUs together, that's how we can move data insanely fast, and now we've connected like 72 GPUs

00:32:59.560 | on one. That's, that's just to show you, um, like mixture of experts trying to compute attention across

00:33:05.320 | all of these different things. But that's actually a really good question.

00:33:10.840 | No, I don't necessarily think so. Like the entire industry is, you know, going after that problem.

00:33:16.360 | That's why everybody wants to maybe see something other than attention and, ah, you know, there's so much

00:33:21.480 | excitement there. Yeah, unfortunately you don't have to call Tom on us now, but that's been fantastic. Thank you, Mo.