Optimizing inference for voice models in production

. Hello, everyone. Thank you so much for being here, for sticking around for this talk. I'm going to be talking about optimizing influence for voice models in production. I'm going to be talking mostly about the runtime component, but also just a little bit on the infrastructure side. Just a quick introduction.

I'm Philip from Base10. Base10 is a model influence platform. We run production workloads for a wide variety of AI-native startups and enterprises. I'm based here in SF. I actually just moved here. It's really awesome. My favorite part about being in SF is much better sports teams than I had in Chicago.

And one of my favorite voice models is Orpheus TTS, which we're going to be talking about a whole bunch today. Quick agenda. So we're going to talk about TTS model architecture. Like, what is a text-to-speech model actually when you look on the config in Hugging Face? What sort of performance metrics are we looking at?

What sort of optimization techniques can we do to make the model better? How do we measure whether or not we succeeded? And then finally, what can we do on the infrastructure and client code to not shoot ourselves in the foot after doing a ton of runtime work and then just adding all that latency back by not doing our client code correctly?

So, architecture. This is one of the things I've been learning this year, which has been pretty great to realize. It's made life a lot simpler at the runtime level. Now, this is wrong. Like, the thing up here that I'm going to say is that, like, everything is an LLM.

That is wrong, but it's useful. There's kind of like two types of models. There's auto-aggressive transformers models that are LLM or very LLM-adjacent. You see this in embeddings. You see this in transcription with stuff like whisper. TTS is another example. You also have the more like diffuser image type models, which is like a very different optimization problem.

But something that's cool is because TTS models are so architecturally similar to LLMs or in many cases derived directly from LLMs, we can access the rich ecosystem of LLM tooling and use it to make TTS models better. So, the TTS model that we're going to be using and as an example all day is Orpheus TTS.

We're using it for two reasons. Okay, three reasons. The two reasons are because it's open source and it's really good. And also, I think Elias and Amu and everyone at Canopy Labs is really awesome. So, that's the third reason we're talking about their model. But it's a Llama 3.2 3B backbone.

So, like, if you look at this is the literal like config from hugging face copy and pasted onto the screen. It's a Llama for causal LM architecture. And so, because of that we can do like all of our normal Llama stuff to this model and make it faster. They did a couple things.

I mean, they did a bunch of things to make it work. But a couple things that are relevant here. There is a larger vocab size because you need all the speech specific tokens like laugh and stuff. And then they also extended the context links with rope scaling. So, we've got to make sure everything we do supports that.

So, performance metrics. Like, what do we want to actually do here? We think about LLM metrics a little bit here. We just look at them a little bit differently. So, in LLMs you talk about time to first token. Now we're talking about time to first byte or sometimes even time to first sentence.

We need a little bit more of a useful output from the model before we really start feeling good about our response time. We do think about tokens per second, although we're going to think about it differently, which I'll explain later. And we mostly think about throughput, which is, you know, how many requests are we able to serve at a given time.

So, on that, you know, goals perspective, if you ask me, like, hey, Philip, how do you want to optimize Llama in general? I'll say, well, we want a lot of TPS. We want 100. We want 500 TPS. We want 1,000 tokens per second. We want as many tokens per second as we can get.

With voice models, you actually don't necessarily need that. In many cases, you only want as many tokens per second as you need for a real-time stream. For Orpheus, that's like 83 tokens per second, which for like a 3 billion per LLM is nothing. But what we actually want to do instead is we want to, once we hit that mark, start optimizing for time to first byte so that our latency is really good and start optimizing for concurrency so that we can get more connections and spend less on GPUs.

So, our goal in general, if all of these very nice and definitely not AI-generated people, all the different, like, voices that our model is capable of creating, these are all the voice agents that we're running, how can we make all of these people fit on one or even less than one GPU?

That's the goal. So, how do we do it? So, how do we do it? Bunch of ways. So, first off, it's an LLM. If you are running an LLM with like VLLM, for example, you can generally, in many cases, get better performance with TensorRT LLM. TensorRT is something that we've been using at Base10 a lot.

I like to joke that I'm the unofficial marketing department for TensorRT LLM because of how much I talk about it. But it really is fast. It can be a little bit complicated from a developer experience perspective to get up and running with it. But once you are up and running, it works really well.

We can also just like quantize the model. Even though it's small, you can always make it faster by making it smaller. With Hopper architecture, we quantize this model to FP8 pretty successfully. I know usually quantizing really small models like this can lead to performance degradation. But for this model, it's working pretty well in FP8, even when we quantize the KV cache.

And then a lot of the other runtime stuff is actually more like audio specific than it is LLM specific. So one of the big challenges that we don't have with LLMs, which are just parsing nice convenient bits of text back and forth, is you have your audio, you have your audio codec, you have your decoding, all that kind of stuff.

So we use snack, which I was very disappointed to learn is not an actual tasty snack, but an audio decoder. And we actually use torch compile. And torch compile, you might be used to running on, you know, a model, compiling your model weights to make your runtime faster. We're actually using the same kind of system with torch compile and with PyTorch inference mode on the audio decoder and running that on the GPU.

We make sure that all the token batching, token level batching works well throughout the entire pipeline and support multiple streaming protocols. Yeah, so these are the engine settings that you would need. You've got the, you know, quantization type of FP8KV, the FP8Context FMHA in order to, you know, support the, support the hopper architecture and the quantization there.

And here's a quick code sample of, I got a little ahead of my slides, I guess. Here's a little quick code sample of the audio decoding. So, we are basically, you know, batching. Usually, we would talk about continuous batching when we're talking about LLM optimization. We want to package all those tokens together.

In this case, we are doing dynamic batching. So, we're trying to pack as much into a batch as we can, but every 15 milliseconds, we're going to shoot it out. You've got that timeout set up here. If you want to trade off for a little bit of latency for more throughput, you can make that batch bigger.

So, yeah, we don't have token level continuous batching yet here, but we do have dynamic batching, which is going to get you pretty close. And because of this, actually, something that I was surprised about when we profiled this is that our TTS implementation with Orpheus is actually, in many cases, CPU bound.

Which is kind of where you want to be. You can throw more CPUs at a resource pretty efficiently. Even though the next token prediction and the audio decoding are both on the GPU, both of those loops hit the CPU at different points. And that can actually be the bottleneck in the number of simultaneous streams that we're able to create.

So, how'd we do? Like, I just showed you a lot of code and talked through it really quickly without really getting into depth. That could all just be smoke and mirrors. Let's see if it's actually any faster. So, again, the number one thing is going to be simultaneous streams because you want to be able to be very cost efficient and use fewer GPU resources to serve a, you know, larger stream, large amount of traffic.

And in this case, a base implementation, I don't necessarily want to, like, call anyone out because there's a lot of really good ways to run this model. You can get really good performance with VLLM. But this is just kind of like the off the shelf. Just take it, run it completely standard implementation.

So, with variable traffic, we're able to support 16 simultaneous streams and with constant traffic, 24 simultaneous streams on an H100 MIG. So, this is actually half an H100 GPU. It's a skew that we do a lot because it's really good for these small models where you want the hopper performance, the hopper architecture, uplift, and tensor RT LLM, the FP8 support.

But you don't want to pay for, like, an entire 80 gigabyte GPU for just a 3 billion parameter model. So, you know, we're seeing much better concurrency. So, if you kind of, like, price that out with, like, our list prices and stuff, you can get, you know, a few cents per hour of conversation, which is going to be, you know, substantially better than if you're -- if you have the volume for it, it's going to be substantially better than paying for a sort of, like, per token type API.

But, okay, sure, maybe it's cheap at scale, but is it fast? Yes, it's fast. So, with the, you know, with the TRT implementation on the MIGs and on the H100s, we can actually get all the way down to 150 millisecond time to first bytes in, like, real-world testing that we've done.

Now, that -- we'll talk in a minute, like, that doesn't mean your whole pipeline is that fast. That's just, like, one part of the pipeline, but it's important because, you know, you definitely don't want to be spending a lot of time waiting around for that first token. So, to kind of transition into that discussion of, like, what can go wrong here, like, you have this graph and you have this, you know, nice config that I had up here.

And you're, like, all right, cool, I'm going to take this, I'm going to put it in production, and I'm going to see the results that he put up on screen. And it's going to work great. And the answer is no, it's not. It's a little bit harder than that.

So, the thing is, like, non-run-time factors when we get -- especially with these small models and with these multimodal systems -- can actually be, like, way more important than your runtime. And that's your infrastructure and your client code. Because, you know, I showed here -- all right, maybe I got it, you know, I cut the runtime in half from the base implementation.

I saved a couple hundred milliseconds. Very easy to add those couple hundred milliseconds back and well beyond that by, you know, sending my query to New York instead of California. Or by having to establish a session every time I, you know, run my client code. So, a few, like, pitfalls to avoid.

Number one, like, if you go in, you know, our model library or something, and we're just trying to get you started very quickly with this kind of inference sample, it's basically going to be, hey, use requests, make a stream, stream it to your local computer, and start, you know, playing it on FFMPG or something.

The issue is that here, like, the requests are going to be sent sequentially, and you need to create a new session every time. That takes time. So, if you're using this in production, you want a code sample -- by the way, this is all up on our GitHub -- you'll want a code sample that looks a lot more like a benchmarking script, where you're using a multiprocess pool.

You're sharing the session between all of these different requests, and you're actually sending traffic with the concurrency that allows you to, you know, saturate this benchmark with, you know, the multiple concurrent requests. Finally, both of these code samples, they do sit on top of HTTP and HTTP streaming. In many cases, if you're implementing voice pipelines, you're going to use something like LiveKit or PipeCat or something, and you're also potentially going to be using a different protocol.

You're going to be using something like WebSockets or GRPC, which we do have support for. And finally, I wanted to leave you on the thought that these, you know, these models are only one part of a voice agent pipeline. So, like, we can spend a lot more than 15 minutes actually talking about, like, the very detailed implementation mechanics of making your voice model faster, of, you know, we haven't even touched on stuff like fine-tuning the model, you know, custom voices, zero-shot voice cloning, being able to, you know, remove static and popping at the end of messages.

There's a lot of work to do just on the voice part, but it really only is one-third of the problem. When I think about voice agents, I think about three parts: listening, thinking, talking. And the most important thing here is, again, while you can have great runtimes, the infrastructure to connect these three together is really what's going to determine your latency.

Being able to go from one model to have the next one running in the same data center with, you know, minimal, like, minimal network overhead in between the two. Even things as simple as not having to go off and do a hairpin at the DNS level and come back.

If that saves you 10 milliseconds on every step and your voice pipeline has this and, you know, a chunking algorithm, it's got an interruption model, and so you end up having four or five steps, well, they're just hairpinning alone is costing you 40 or 50 milliseconds. And that can be 10% of your SLA for a voice model.

So, yeah, that's my main point here is that as much fun as it is to talk about the runtime stuff and as much work as we do there, the infrastructure and the client implementation is equally important, if not more so. Anyway, thank -- so, yeah, that's the review. Thank you all for coming through.

I have -- we're doing an event next week at Fogo de Chao, which is going to be pretty fun. I'm going to be talking in more detail about building some systems with open source models, and there's also going to be a lot of steak. So, definitely come on through if you're interested, and I'm on Twitter.

I'm on LinkedIn. So, it's base 10. Hit me up if you have any questions about this or anything else model performance. Thank you so much, and I'll let you go eight seconds early. Thank you. Thank you. Thank you. Thank you. I'll see you next time.

Optimizing inference for voice models in production - Philip Kiely, Baseten

Chapters

Transcript