back to indexHacking the Inference Pareto Frontier - Kyle Kranen, NVIDIA

Chapters
0:0 Introduction to Breaking the Inference Pareto Frontier
0:33 Introduction of Kyle Cranon and NVIDIA Dynamo
1:31 The Three Pillars of Deployment (Quality, Latency, Cost)
2:11 Understanding the Pareto Frontier
3:6 Application-Specific Prioritization of Quality, Latency, and Cost
4:32 Common Techniques to Manipulate the Pareto Frontier (Quantization, RAG, Reasoning)
5:19 Compounding Techniques
6:4 Three Drivers for Modifying the Pareto Frontier (Scale, Structure, Dynamism)
6:20 Scale: Disaggregation
11:2 Scale: Routing
13:0 Structure: Inference Time Scaling
16:14 Structure: KV Manipulation
17:43 Dynamism: Worker Specialization
18:42 Dynamism: Dynamic Load Balancing
19:55 Conclusion and NVIDIA Dynamo Resources
00:00:00.040 |
Hey there, everyone. I'm Kyle Cranin and today I'll be talking about how to break the inference 00:00:18.760 |
Pareto frontier in your advantage. Really, the thing that enables the success is that a good 00:00:25.240 |
model and a good system that takes into account the actual constraints for what you need from 00:00:30.460 |
your deployment is actually key to the success of both your deployment and the application 00:00:35.260 |
that is backed by it. So, who am I and why am I talking about this? As I said, my name 00:00:42.120 |
is Kyle Cranin. Currently, I work at NVIDIA. Previously at NVIDIA, I was leading and GMing 00:00:50.680 |
the largest inference deployment at NVIDIA with a multiple tens of millions dollar quarterly 00:00:56.080 |
cloud bill. And now I'm an architect and lead for a project that we just released an open 00:01:01.420 |
source called NVIDIA Dynamo that aims to do things like enable data center scale inference 00:01:07.300 |
to manipulate your deployment and manipulate the Pareto frontier in order to achieve better 00:01:12.020 |
SLAs or achieve lower costs for existing SLAs with techniques like disaggregation or more 00:01:17.880 |
techniques that we'll talk about later in the talk. Dynamo is... The Dynamo meetup is linked 00:01:23.600 |
right here. You can learn more about Dynamo there if you want to look it up. And I'll also have 00:01:27.240 |
that at the end of the talk as well. So, the three things that we or I like to think about 00:01:33.620 |
when I'm thinking about whether or not something can actually be deployed and used is really simple. 00:01:39.020 |
It's quality. Whether or not your application and the system around your model is capable of, you know, 00:01:44.420 |
completing tasks with some level of accuracy or quality. Latency. Whether or not the task can be 00:01:50.720 |
completed in a fast enough envelope for, you know, either the user to be happy or to meet safety 00:01:55.520 |
guarantees like for robotics. And cost. Can the LLM complete the task cheaply enough per request in 00:02:03.200 |
order for you to meet whatever, you know, margin requirements you have for your application. 00:02:07.160 |
And one of the ways that we generally compare these three things is through Pareto frontier. Now, 00:02:14.720 |
the frontier I'm showing here is two-dimensional. It's actually really hard to plot things in 3D on 00:02:19.700 |
a 2D slide. So, I'm gonna just show two dimensions. Really what this looks like is you have this, like, 00:02:24.600 |
edge. We have this edge that sort of represents the best or to the top and rightmost points that we 00:02:35.700 |
achieve for, you know, a specific set of attributes. So, in this case, we have the TPS per GPU, which is 00:02:42.700 |
effectively a cost metric. How many requests can you handle per GPU per second? And the user TPS, 00:02:48.000 |
which is a responsiveness metric, right? So, this is the latency versus the cost. And for different 00:02:53.340 |
applications, really, you want to enable your Pareto front -- enable -- you actually really only want one 00:03:00.540 |
point on the Pareto frontier. It's right. What is your operating latency? What is the operating quality 00:03:04.440 |
you need? And how can you minimize cost for that? Now, this really actually depends on the application 00:03:09.960 |
you're talking about. So, one of the most important things you're doing when you're thinking about breaking 00:03:14.460 |
the Pareto frontier is you're thinking about your application. So, for example, if we're talking 00:03:18.840 |
about personal cancer cures, which is a topic that's talked about a lot in the context of generative AI, 00:03:24.060 |
in that situation, latency and cost are pretty much no object, right? You could spend millions of dollars 00:03:30.940 |
on proving out a single cure. And if it works, the return on investment is so high that it doesn't 00:03:36.140 |
really matter. To take a different example, tab completion like those that you see in popular 00:03:42.380 |
IDEs, like cursor, all are very, very dependent upon snappiness. The user expects that when they 00:03:48.540 |
press tab, they will see a recommendation for the next line or the next set of, you know, tokens very, 00:03:54.780 |
very quickly. And then to take another code example, with respect to async code commits, things like, 00:04:01.100 |
you know, cursors, what's it called? It's agent mode. And, you know, other applications where the chatbot or 00:04:11.900 |
applications working next to the user, there's not as much a consideration for latency, but there is 00:04:17.100 |
a concern for both quality and cost. And this sort of breaks down, or this sort of depends upon what 00:04:25.020 |
the user expects from the application. Does it, do they expect it to be fast? Do they expect it to be 00:04:30.780 |
slow? Are they involved in the loop with this application? Now, there are a series of, you know, 00:04:37.180 |
techniques that are pretty commonly known about that all, you know, support the manipulation of this 00:04:41.980 |
frontier. For example, quantization speeds up your latency and it also decreases your cost because you 00:04:47.100 |
can produce higher batch sizes. Retrieval augmented generation generally slows down your application, 00:04:52.300 |
makes it higher latency, increases the cost, but also increases the quality. And reasoning, for example, 00:04:58.220 |
similar, you know, you produce more tokens to think. And changing the model config allows you to do any of 00:05:03.340 |
these things. If you change how the model is represented in a parallel manner, you can significantly 00:05:08.700 |
change the characteristics of speed, cost, and theoretically quality if you're talking about 00:05:13.980 |
non-haloed context parallelism. The thing that I want to impart upon you before we jump into, like, 00:05:19.740 |
a lot more of these advanced techniques is that these techniques can be compounded. So, for example, 00:05:23.980 |
if you have an initial application and has some required performance, you can actually stack, 00:05:28.860 |
for example, retrieval augmented generation in order to increase the quality, but make the latency worse. 00:05:34.140 |
And you can also stack on top of that quantization of the model in order to speed up your latency. 00:05:40.620 |
The point I'm, you know, trying to make here is that you really have this toolbox of sets of large sets 00:05:47.500 |
of tools that you can use together. And the tools themselves are not independent and can be combined 00:05:53.020 |
in very sometimes non-obvious ways in order to actually break your Pareto frontier or squeeze it 00:05:59.740 |
in different directions in order to support your application. So, there are three things outside of 00:06:05.820 |
those techniques that I tend to think drive, you know, how you can modify the Pareto frontier going forward. 00:06:16.300 |
Those three are scale, structure, and dynamism. So, one of the things that is, you know, 00:06:22.860 |
really relevant in the realm of scale is disaggregation. So, for those that aren't aware, 00:06:29.420 |
KV caching is a technique by which you take the key and value vectors that are associated with each token 00:06:37.100 |
and you cache them so that when you're doing autoregressive generation, you don't have to generate 00:06:41.740 |
the entire set of key and value vectors for the entire sequence up to this point. You can just 00:06:46.140 |
generate new ones and put them back into the KV cache. What this actually means is that we effectively 00:06:52.460 |
have two phases of generation. One in which you're generating the pre-fill or filling up your KV cache 00:06:59.500 |
and one in which you're actually generating new KV cache as well as new tokens and producing output. 00:07:05.500 |
Now, disaggregation as a technique basically allows you to have these, you know, two phases which were 00:07:12.940 |
typically used on the same set of GPUs onto multiple different workers and sets of GPUs. 00:07:19.500 |
And this provides a couple of key benefits that we'll go into right now. The three really big 00:07:25.260 |
benefits here are that you can really now take two sort of phases that have very different needs. 00:07:32.380 |
Pre-fill is very compute bound and decode depending on that application and model can be very memory bound. 00:07:38.620 |
And it allows you to do a granular load matching between those two phases. And what this means is, 00:07:43.420 |
you know, compute saturates relatively early to use DeepSea as an example. Compute saturates 00:07:49.420 |
relatively early and you may use relatively few GPUs for your pre-fill instances and have them handle a 00:07:55.980 |
lower batch size. But handle a much larger batch size with many more GPUs for your decode instances. 00:08:02.700 |
And this split and this heterogeneity between the two actually allows you to produce far more performance. 00:08:09.500 |
The other thing is that, you know, one of the problems is that if you have in-flight batching, 00:08:16.620 |
you have many tokens coming in at the same time that are in different phases of generation, right, 00:08:20.860 |
if you have a request that's doing pre-fill and a request that's doing decode on the same machine, 00:08:24.300 |
you get scheduling conflicts and the scheduler basically has to decide whether or not it handles 00:08:28.380 |
new tokens. There's some techniques to handle this, like in-flight batching and chunk piggybacking, 00:08:33.820 |
or sorry, chunk piggybacking. But generally, there is a cost to doing that mutual scheduling. 00:08:39.340 |
So splitting this out makes the scheduling simpler. Now, there's an asterisk to this, which is that, 00:08:44.060 |
sorry, really quickly, and I'll go over the performance numbers. I'm going to use Llama70B as an 00:08:48.620 |
example. Right here, we have on our left axis, or our Y axis, we have the tokens per second per GPU. 00:08:54.540 |
On our X axis, we have the tokens per second per user, right? Up and to the right is better. 00:08:59.020 |
If we choose one operating point at latency, disaggregating on the same number of GPUs, 16 total H100s, 00:09:05.820 |
we can achieve up to two times the tokens per second per GPU at a fixed latency, which means that 00:09:12.300 |
you're now paying two times less for your application. There are some constraints, though. 00:09:18.460 |
The use case really does dictate performance for disaggregation. For example, low input length 00:09:25.500 |
use cases have little to no speed up, because you don't actually have as much of the scheduling 00:09:31.180 |
problem. They're very pre-filled light, so you're basically just doing decode the entire time. 00:09:34.300 |
And then, per the graph, disaggregation, and I'll go back to the graph, actually, disaggregation is useful 00:09:41.820 |
usually in the middle of the graph. In very high latency, high throughput scenarios, which would be the left and 00:09:47.900 |
top of the graph. And low latency, low throughput scenarios, the bottom right, 00:09:52.220 |
aggregated tends to reconverge with disaggregated and produce a little bit more performance in those 00:09:58.620 |
cases. That being said, for a lot of user, you know, interactive applications, disaggregation makes 00:10:05.980 |
the most sense because users tend to read that between or care about things in the realm of 20 to 00:10:13.100 |
200 tokens per second. The other thing that's kind of a caveat about this is that configuration is 00:10:20.140 |
really important. Since you're separating these two phases into pre-fill and generation, the balance 00:10:25.820 |
between the number of workers for pre-fill and decoding dictates the performance. So, for example, 00:10:30.860 |
if you have too many decode workers, you're basically gonna have decode workers that are starving for 00:10:36.460 |
work. And if you pre-fill workers, they're going to be generating work for the decode workers, and the 00:10:41.500 |
decode workers are gonna be being pushed down by, you know, just an increasing amount of load and 00:10:46.540 |
increasing queue depth. And the other thing is that modifying this is kind of expensive and hard, 00:10:53.580 |
because the balance between pre-fill and decode depends on the parallel configs of each. So, 00:10:58.060 |
it's like this really wide configuration space. One other thing that we talk about with respect to 00:11:03.580 |
scale is routing. So, we talked about how this KV is important. One of the things that we have to do for 00:11:09.900 |
pre-fill decode disaggregation is that we need to actually transfer the KV between machines. And 00:11:16.060 |
in some sense, there's actually an affinity for some machines to do some work, since the KV cache of 00:11:21.820 |
previous request is actually stored on those GPUs or offloaded onto system memory, host, or external 00:11:29.580 |
storage during the course of inference. So, I actually labeled this wrong. This is not the smart router. 00:11:37.260 |
This -- in a naive case, you would route pretty much exclusively randomly, right? Or towards, you know, 00:11:48.860 |
anything. So, in this case, we're biasing towards -- we're not biasing towards anything. We're sampling 00:11:54.300 |
randomly. Alternatively, you know, if we're talking about this -- if routing to worker three in this 00:11:58.620 |
case, it could also be that you're optimizing for purely your KV match. If you're doing -- this is 00:12:05.420 |
actually inverted -- if you're doing a KV-based router, you may end up biasing towards machines that 00:12:13.580 |
have too high KV load and, therefore, are not going to be able to handle the request and you end up with 00:12:16.780 |
queueing. In a smart case, you actually want to minimize the -- sort of this cost function that 00:12:23.260 |
includes both the amount of prefix match that you can get from -- or maximize the prefix match that you 00:12:29.580 |
can get from the work that's already been done on that node and the amount of load that already exists 00:12:35.660 |
on that node. And as you scale out, as you get more and more GPUs in a deployment, you actually end up 00:12:44.300 |
with more and more represented KV space that's local to those machines. And because of that, having a 00:12:50.540 |
larger and larger deployment means that you get an asymptotically increasing KV cache hit rate, which 00:12:56.220 |
means that you're doing less and less pre-fill work over time. So routing, you know, here to give it a report 00:13:03.660 |
card, increases your speed and cost, and doesn't really have an effect upon quality because it's -- it's doing 00:13:08.460 |
the same work that it would normally do. Now, we talk about structure. Structure is really important 00:13:14.380 |
because we have a lot of these workloads that you guys have probably seen at the AI Engineers World Fair, like 00:13:19.180 |
agents, for example. Agents impart a structure on the workload in that they have moderately predictable 00:13:26.060 |
usage patterns between concurrent requests. So an example here is inference time scaling. This is a 00:13:30.940 |
cool graph I'll go over really quickly. For example, we have three models here. In green, we have an 8b 00:13:35.820 |
model. In yellow, we have a 49b model. And in red, we have a 235b model. We find that with inference time 00:13:42.380 |
scaling -- that is to re-query the model and to, you know, prompt it to reconsider its results or reason more 00:13:47.740 |
about its results, we can produce better and better results. And you actually see this really interesting 00:13:52.140 |
trend where with about three or four times of re-querying, we can see that the 8b model is basically 00:14:00.140 |
on par with respect to quality as the 49b model. And the 49b is almost on par with respect to quality 00:14:06.860 |
as the 235b model. And we note here that, like, the cost of querying that 8b model, 00:14:16.620 |
you know, even querying it multiple times is actually lower than querying the larger model, 00:14:24.460 |
right? And in this sense, you know, we basically see that inference time scaling can be considered 00:14:31.180 |
sort of as, you know, increasing quality at the cost of speed and at the cost of cost, right? Because 00:14:37.820 |
you're re-querying it. But alternatively, if you keep quality fixed, you can basically get lower 00:14:46.700 |
latency and lower cost by using a smaller model and re-querying it multiple times. And the structure 00:14:53.340 |
that we infer from doing that re-querying allows us to do better scheduling. So in this graph, we have a 00:14:59.660 |
series of curves that represent basically the runtime of a given reasoning example from the natural plan 00:15:06.540 |
dataset. And the -- basically the concurrency. So this is, like, a graph of, like, how many you can -- how 00:15:14.140 |
many concurrent instances you can run at once. And we sample across all this concurrency. We see that 00:15:20.780 |
implanting disaggregation gives us a small benefit, mostly because this dataset is very ISL -- short ISL, 00:15:26.940 |
long OSL. You don't get a whole ton of benefit from disaggregation. In this case, 00:15:30.860 |
basically making -- you know, removing a round trip by making the re-queries come from the router 00:15:38.060 |
instead of coming from the user or the client on the outside, allows you to really decrease these 00:15:43.340 |
round trips with respect to latency. And then on top of that, making the router aware and making the LLM 00:15:48.300 |
scheduler aware that you are doing, you know, repeat work, you're re-querying it, actually gives you an 00:15:54.860 |
increased benefit that is the red line to the green line. That is to say, amongst a wide variety of 00:16:00.700 |
models, if we assume that the quality is fixed, we can actually use inference time scaling and some smart 00:16:06.780 |
techniques in order to significantly decrease latency and increase throughput while maintaining the same 00:16:11.260 |
quality. One last thing, and I'm going to go through this really quick because I'm getting low on time, 00:16:16.860 |
is manipulating K and V values, right? We've sort of talked about how before there's this work that 00:16:22.620 |
we do in pre-fill that we don't want to lose. We do routing to ensure that we don't lose this KV. 00:16:29.500 |
And, you know, if we have, like, a workflow where we know the run times of things -- so, for example, 00:16:35.900 |
if we do a, you know, a tool call, for example, and we know this tool call takes a moderately 00:16:43.980 |
deterministic amount of time, we basically end up with this KV eviction, right? If we have a tool call 00:16:48.860 |
that takes 30 seconds, the KV is going to be swept out from HBM and you're not going to be able to use 00:16:53.500 |
it in the future because it's no longer being cached. But if we know that it's going to be used again, 00:16:59.260 |
why not just offload it, right? Basically, inference time scaling gives you structure to manipulate your 00:17:05.900 |
KV. Tool calling gives you structure to manipulate your KV. So, instead of doing another pre-fill 00:17:10.860 |
the second time, you might, for example, do pre-fill once, do, you know, the LM call, the decode once, 00:17:18.940 |
move it to host memory, and then, you know, at the time at which you expect the tool to complete, 00:17:24.140 |
you move it right back into GPU memory so it's ready for the next LM call that will include this 00:17:28.860 |
added context from the tool. So, KV manipulation, you know, again, increases your speed and decreases 00:17:35.580 |
your cost while also, you know, improving your quality. Or not improving quality, keeping quality 00:17:40.700 |
constant. The last thing that we have to talk about here is dynamism. Worker specialization is really 00:17:46.380 |
important. As I said, since you have different characteristics of disaggregation at different 00:17:52.300 |
input sequence lengths and output sequence lengths, you actually want to have a mix of aggregated and 00:17:56.940 |
disaggregated workers based on where you are in the OSL/ISL histogram. So, at, you know, lower input 00:18:03.740 |
sequence lengths and higher output sequence lengths, you might want to do aggregated with a higher tensor 00:18:07.260 |
parallelism. In the middle of the range, you may want to use... In the middle of the input sequence 00:18:11.180 |
range, you may want to use disaggregated. And in the, you know, long context, you know, regime, 00:18:18.860 |
you may want to use disaggregated with context parallelism. Now, again, this differs model to 00:18:23.980 |
model. And this is just an exemplary graph. But generally, if you specialize workers, you can also 00:18:31.180 |
increase... Increase your speed, decrease your cost while keeping quality the same. Because, again, 00:18:35.660 |
you're not actually touching the execution... What the model is executing. You're not touching the math it's 00:18:40.220 |
doing. One last thing about dynamism is load balance is quite important. As I mentioned earlier, doing... 00:18:48.220 |
Looking at the amount of P and D workers is really important to determine whether or not your disaggregated 00:18:54.620 |
deployment is going to be successful. So, for example, if you have a histogram that you initially create 00:18:58.780 |
your configuration based off of, for example, if you have app A and app B that have, like, 00:19:03.100 |
these two input sequence length and output sequence lengths, you may end up with a scenario where a 00:19:08.540 |
change in user distribution causes significant issues with your deployment. Your dec... In this case, 00:19:14.780 |
by... When you increase your input sequence length and output sequence length by a little bit more, 00:19:19.900 |
your input sequence length, you might create more demand for prefill workers than you do for decode 00:19:24.780 |
workers. So, your balance will change over time. And this has been empirically proven by a wide variety 00:19:30.540 |
of people that publish data. And you actually have to do auto-scaling across these two types of instances 00:19:39.500 |
in real time to account for changes in user usage distribution of your platform. So, in this case, 00:19:46.140 |
dynamic load balancing increases your speed and keeps your costs low. But mostly, it's really just 00:19:53.100 |
essential to ensuring that disaggregation actually works to maximum potential. 00:19:58.940 |
Okay. Last things. Here is the Dynamo repo. It's right here. It's github.com/ai-dynamo. 00:20:07.180 |
We also have a Dynamo meetup that is being hosted tomorrow, Thursday from 5:00 to 8:00 PM here in San 00:20:12.540 |
Francisco. Please come. We're going to be talking a lot more about how we actually implement these things at