Hey there, everyone. I'm Kyle Cranin and today I'll be talking about how to break the inference Pareto frontier in your advantage. Really, the thing that enables the success is that a good model and a good system that takes into account the actual constraints for what you need from your deployment is actually key to the success of both your deployment and the application that is backed by it.
So, who am I and why am I talking about this? As I said, my name is Kyle Cranin. Currently, I work at NVIDIA. Previously at NVIDIA, I was leading and GMing the largest inference deployment at NVIDIA with a multiple tens of millions dollar quarterly cloud bill. And now I'm an architect and lead for a project that we just released an open source called NVIDIA Dynamo that aims to do things like enable data center scale inference to manipulate your deployment and manipulate the Pareto frontier in order to achieve better SLAs or achieve lower costs for existing SLAs with techniques like disaggregation or more techniques that we'll talk about later in the talk.
Dynamo is... The Dynamo meetup is linked right here. You can learn more about Dynamo there if you want to look it up. And I'll also have that at the end of the talk as well. So, the three things that we or I like to think about when I'm thinking about whether or not something can actually be deployed and used is really simple.
It's quality. Whether or not your application and the system around your model is capable of, you know, completing tasks with some level of accuracy or quality. Latency. Whether or not the task can be completed in a fast enough envelope for, you know, either the user to be happy or to meet safety guarantees like for robotics.
And cost. Can the LLM complete the task cheaply enough per request in order for you to meet whatever, you know, margin requirements you have for your application. And one of the ways that we generally compare these three things is through Pareto frontier. Now, the frontier I'm showing here is two-dimensional.
It's actually really hard to plot things in 3D on a 2D slide. So, I'm gonna just show two dimensions. Really what this looks like is you have this, like, edge. We have this edge that sort of represents the best or to the top and rightmost points that we achieve for, you know, a specific set of attributes.
So, in this case, we have the TPS per GPU, which is effectively a cost metric. How many requests can you handle per GPU per second? And the user TPS, which is a responsiveness metric, right? So, this is the latency versus the cost. And for different applications, really, you want to enable your Pareto front -- enable -- you actually really only want one point on the Pareto frontier.
It's right. What is your operating latency? What is the operating quality you need? And how can you minimize cost for that? Now, this really actually depends on the application you're talking about. So, one of the most important things you're doing when you're thinking about breaking the Pareto frontier is you're thinking about your application.
So, for example, if we're talking about personal cancer cures, which is a topic that's talked about a lot in the context of generative AI, in that situation, latency and cost are pretty much no object, right? You could spend millions of dollars on proving out a single cure. And if it works, the return on investment is so high that it doesn't really matter.
To take a different example, tab completion like those that you see in popular IDEs, like cursor, all are very, very dependent upon snappiness. The user expects that when they press tab, they will see a recommendation for the next line or the next set of, you know, tokens very, very quickly.
And then to take another code example, with respect to async code commits, things like, you know, cursors, what's it called? It's agent mode. And, you know, other applications where the chatbot or applications working next to the user, there's not as much a consideration for latency, but there is a concern for both quality and cost.
And this sort of breaks down, or this sort of depends upon what the user expects from the application. Does it, do they expect it to be fast? Do they expect it to be slow? Are they involved in the loop with this application? Now, there are a series of, you know, techniques that are pretty commonly known about that all, you know, support the manipulation of this frontier.
For example, quantization speeds up your latency and it also decreases your cost because you can produce higher batch sizes. Retrieval augmented generation generally slows down your application, makes it higher latency, increases the cost, but also increases the quality. And reasoning, for example, similar, you know, you produce more tokens to think.
And changing the model config allows you to do any of these things. If you change how the model is represented in a parallel manner, you can significantly change the characteristics of speed, cost, and theoretically quality if you're talking about non-haloed context parallelism. The thing that I want to impart upon you before we jump into, like, a lot more of these advanced techniques is that these techniques can be compounded.
So, for example, if you have an initial application and has some required performance, you can actually stack, for example, retrieval augmented generation in order to increase the quality, but make the latency worse. And you can also stack on top of that quantization of the model in order to speed up your latency.
The point I'm, you know, trying to make here is that you really have this toolbox of sets of large sets of tools that you can use together. And the tools themselves are not independent and can be combined in very sometimes non-obvious ways in order to actually break your Pareto frontier or squeeze it in different directions in order to support your application.
So, there are three things outside of those techniques that I tend to think drive, you know, how you can modify the Pareto frontier going forward. Those three are scale, structure, and dynamism. So, one of the things that is, you know, really relevant in the realm of scale is disaggregation.
So, for those that aren't aware, KV caching is a technique by which you take the key and value vectors that are associated with each token and you cache them so that when you're doing autoregressive generation, you don't have to generate the entire set of key and value vectors for the entire sequence up to this point.
You can just generate new ones and put them back into the KV cache. What this actually means is that we effectively have two phases of generation. One in which you're generating the pre-fill or filling up your KV cache and one in which you're actually generating new KV cache as well as new tokens and producing output.
Now, disaggregation as a technique basically allows you to have these, you know, two phases which were typically used on the same set of GPUs onto multiple different workers and sets of GPUs. And this provides a couple of key benefits that we'll go into right now. The three really big benefits here are that you can really now take two sort of phases that have very different needs.
Pre-fill is very compute bound and decode depending on that application and model can be very memory bound. And it allows you to do a granular load matching between those two phases. And what this means is, you know, compute saturates relatively early to use DeepSea as an example. Compute saturates relatively early and you may use relatively few GPUs for your pre-fill instances and have them handle a lower batch size.
But handle a much larger batch size with many more GPUs for your decode instances. And this split and this heterogeneity between the two actually allows you to produce far more performance. The other thing is that, you know, one of the problems is that if you have in-flight batching, you have many tokens coming in at the same time that are in different phases of generation, right, if you have a request that's doing pre-fill and a request that's doing decode on the same machine, you get scheduling conflicts and the scheduler basically has to decide whether or not it handles new tokens.
There's some techniques to handle this, like in-flight batching and chunk piggybacking, or sorry, chunk piggybacking. But generally, there is a cost to doing that mutual scheduling. So splitting this out makes the scheduling simpler. Now, there's an asterisk to this, which is that, sorry, really quickly, and I'll go over the performance numbers.
I'm going to use Llama70B as an example. Right here, we have on our left axis, or our Y axis, we have the tokens per second per GPU. On our X axis, we have the tokens per second per user, right? Up and to the right is better. If we choose one operating point at latency, disaggregating on the same number of GPUs, 16 total H100s, we can achieve up to two times the tokens per second per GPU at a fixed latency, which means that you're now paying two times less for your application.
There are some constraints, though. The use case really does dictate performance for disaggregation. For example, low input length use cases have little to no speed up, because you don't actually have as much of the scheduling problem. They're very pre-filled light, so you're basically just doing decode the entire time.
And then, per the graph, disaggregation, and I'll go back to the graph, actually, disaggregation is useful usually in the middle of the graph. In very high latency, high throughput scenarios, which would be the left and top of the graph. And low latency, low throughput scenarios, the bottom right, aggregated tends to reconverge with disaggregated and produce a little bit more performance in those cases.
That being said, for a lot of user, you know, interactive applications, disaggregation makes the most sense because users tend to read that between or care about things in the realm of 20 to 200 tokens per second. The other thing that's kind of a caveat about this is that configuration is really important.
Since you're separating these two phases into pre-fill and generation, the balance between the number of workers for pre-fill and decoding dictates the performance. So, for example, if you have too many decode workers, you're basically gonna have decode workers that are starving for work. And if you pre-fill workers, they're going to be generating work for the decode workers, and the decode workers are gonna be being pushed down by, you know, just an increasing amount of load and increasing queue depth.
And the other thing is that modifying this is kind of expensive and hard, because the balance between pre-fill and decode depends on the parallel configs of each. So, it's like this really wide configuration space. One other thing that we talk about with respect to scale is routing. So, we talked about how this KV is important.
One of the things that we have to do for pre-fill decode disaggregation is that we need to actually transfer the KV between machines. And in some sense, there's actually an affinity for some machines to do some work, since the KV cache of previous request is actually stored on those GPUs or offloaded onto system memory, host, or external storage during the course of inference.
So, I actually labeled this wrong. This is not the smart router. This -- in a naive case, you would route pretty much exclusively randomly, right? Or towards, you know, anything. So, in this case, we're biasing towards -- we're not biasing towards anything. We're sampling randomly. Alternatively, you know, if we're talking about this -- if routing to worker three in this case, it could also be that you're optimizing for purely your KV match.
If you're doing -- this is actually inverted -- if you're doing a KV-based router, you may end up biasing towards machines that have too high KV load and, therefore, are not going to be able to handle the request and you end up with queueing. In a smart case, you actually want to minimize the -- sort of this cost function that includes both the amount of prefix match that you can get from -- or maximize the prefix match that you can get from the work that's already been done on that node and the amount of load that already exists on that node.
And as you scale out, as you get more and more GPUs in a deployment, you actually end up with more and more represented KV space that's local to those machines. And because of that, having a larger and larger deployment means that you get an asymptotically increasing KV cache hit rate, which means that you're doing less and less pre-fill work over time.
So routing, you know, here to give it a report card, increases your speed and cost, and doesn't really have an effect upon quality because it's -- it's doing the same work that it would normally do. Now, we talk about structure. Structure is really important because we have a lot of these workloads that you guys have probably seen at the AI Engineers World Fair, like agents, for example.
Agents impart a structure on the workload in that they have moderately predictable usage patterns between concurrent requests. So an example here is inference time scaling. This is a cool graph I'll go over really quickly. For example, we have three models here. In green, we have an 8b model. In yellow, we have a 49b model.
And in red, we have a 235b model. We find that with inference time scaling -- that is to re-query the model and to, you know, prompt it to reconsider its results or reason more about its results, we can produce better and better results. And you actually see this really interesting trend where with about three or four times of re-querying, we can see that the 8b model is basically on par with respect to quality as the 49b model.
And the 49b is almost on par with respect to quality as the 235b model. And we note here that, like, the cost of querying that 8b model, you know, even querying it multiple times is actually lower than querying the larger model, right? And in this sense, you know, we basically see that inference time scaling can be considered sort of as, you know, increasing quality at the cost of speed and at the cost of cost, right?
Because you're re-querying it. But alternatively, if you keep quality fixed, you can basically get lower latency and lower cost by using a smaller model and re-querying it multiple times. And the structure that we infer from doing that re-querying allows us to do better scheduling. So in this graph, we have a series of curves that represent basically the runtime of a given reasoning example from the natural plan dataset.
And the -- basically the concurrency. So this is, like, a graph of, like, how many you can -- how many concurrent instances you can run at once. And we sample across all this concurrency. We see that implanting disaggregation gives us a small benefit, mostly because this dataset is very ISL -- short ISL, long OSL.
You don't get a whole ton of benefit from disaggregation. In this case, basically making -- you know, removing a round trip by making the re-queries come from the router instead of coming from the user or the client on the outside, allows you to really decrease these round trips with respect to latency.
And then on top of that, making the router aware and making the LLM scheduler aware that you are doing, you know, repeat work, you're re-querying it, actually gives you an increased benefit that is the red line to the green line. That is to say, amongst a wide variety of models, if we assume that the quality is fixed, we can actually use inference time scaling and some smart techniques in order to significantly decrease latency and increase throughput while maintaining the same quality.
One last thing, and I'm going to go through this really quick because I'm getting low on time, is manipulating K and V values, right? We've sort of talked about how before there's this work that we do in pre-fill that we don't want to lose. We do routing to ensure that we don't lose this KV.
And, you know, if we have, like, a workflow where we know the run times of things -- so, for example, if we do a, you know, a tool call, for example, and we know this tool call takes a moderately deterministic amount of time, we basically end up with this KV eviction, right?
If we have a tool call that takes 30 seconds, the KV is going to be swept out from HBM and you're not going to be able to use it in the future because it's no longer being cached. But if we know that it's going to be used again, why not just offload it, right?
Basically, inference time scaling gives you structure to manipulate your KV. Tool calling gives you structure to manipulate your KV. So, instead of doing another pre-fill the second time, you might, for example, do pre-fill once, do, you know, the LM call, the decode once, move it to host memory, and then, you know, at the time at which you expect the tool to complete, you move it right back into GPU memory so it's ready for the next LM call that will include this added context from the tool.
So, KV manipulation, you know, again, increases your speed and decreases your cost while also, you know, improving your quality. Or not improving quality, keeping quality constant. The last thing that we have to talk about here is dynamism. Worker specialization is really important. As I said, since you have different characteristics of disaggregation at different input sequence lengths and output sequence lengths, you actually want to have a mix of aggregated and disaggregated workers based on where you are in the OSL/ISL histogram.
So, at, you know, lower input sequence lengths and higher output sequence lengths, you might want to do aggregated with a higher tensor parallelism. In the middle of the range, you may want to use... In the middle of the input sequence range, you may want to use disaggregated. And in the, you know, long context, you know, regime, you may want to use disaggregated with context parallelism.
Now, again, this differs model to model. And this is just an exemplary graph. But generally, if you specialize workers, you can also increase... Increase your speed, decrease your cost while keeping quality the same. Because, again, you're not actually touching the execution... What the model is executing. You're not touching the math it's doing.
One last thing about dynamism is load balance is quite important. As I mentioned earlier, doing... Looking at the amount of P and D workers is really important to determine whether or not your disaggregated deployment is going to be successful. So, for example, if you have a histogram that you initially create your configuration based off of, for example, if you have app A and app B that have, like, these two input sequence length and output sequence lengths, you may end up with a scenario where a change in user distribution causes significant issues with your deployment.
Your dec... In this case, by... When you increase your input sequence length and output sequence length by a little bit more, your input sequence length, you might create more demand for prefill workers than you do for decode workers. So, your balance will change over time. And this has been empirically proven by a wide variety of people that publish data.
And you actually have to do auto-scaling across these two types of instances in real time to account for changes in user usage distribution of your platform. So, in this case, dynamic load balancing increases your speed and keeps your costs low. But mostly, it's really just essential to ensuring that disaggregation actually works to maximum potential.
Okay. Last things. Here is the Dynamo repo. It's right here. It's github.com/ai-dynamo. We also have a Dynamo meetup that is being hosted tomorrow, Thursday from 5:00 to 8:00 PM here in San Francisco. Please come. We're going to be talking a lot more about how we actually implement these things at the event.