Back to Index

Serving Voice AI at $1/hr: Open-source, LoRAs, Latency, Load Balancing - Neil Dwyer, Gabber


Chapters

0:0 Introduction to Gabber and Real-Time AI
2:15 Gabber's Mission for Consumer AI
4:17 The Orpheus Voice Model
5:43 Challenges in Voice Cloning
7:44 Latency Management and "Head of Line Silence"
11:7 Infrastructure for Batch Inference
11:36 Leveraging vLLM and Dynamic Quantization
13:21 Load Balancing with a Consistent Hash Ring
14:17 System Architecture Overview
15:7 Conclusion and Open Source Shout-outs

Transcript

- - I'm Neil. That's Jack in the front there, but it's just me. So yeah, we're really just gonna talk about our experience hosting Orpheus inference for our real-time stack. So, I'm Neil. I'm the CTO at a company called Gabber, a small startup. But I spent a lot of my career doing real-time media stuff, so sending audio and video around the internet.

Started at a company called Bebo, was ultimately acquired by Amazon, but there I was doing a lot of, we did like a game streaming app, kind of like OBS, built a lot of the streaming infrastructure there, built a ML pipeline to watch people play video games, so they would watch people play Fortnite and put some cool effects on the screen when they got a kill or a victory or something.

So I spent a lot of time in like the G-Streamer trenches and with WebRTC and RTMP and all that stuff. Took a detour, worked at Uber for a couple years, then left that, did a multiplayer gaming startup with my brother Jack here. So doing, basically trying to bring like AAA style multiplayer to web games, so a lot of real, and with voice and stuff too.

So there's a lot of real-time media slash real-time simulation kind of stuff there. And then yeah, we didn't do a super good job there and shut that company down. And we were using LiveKit, I made a LiveKit SDK and that segwayed to me working at LiveKit. I think a lot of people probably heard of LiveKit in this room.

And yeah, the second half of my time at LiveKit, I was spent doing the LiveKit agents platform. So that's like the platform that was kind of born out of LiveKit's involvement with GPT voice. So yeah, wrote the first line of code on that and worked on that. And then yeah, I left LiveKit and did another startup with my brother Gabber.

So that's what we're doing now. So Gabber is real-time, info for real-time, basically AI personas. So we have some core building blocks like voice, memory, video inputs coming soon, tool calling, kind of like the usual suspects, I guess. But our focus is really on the consumer apps. We see the replacing human use cases pretty often, like the call center use cases, customer support, AI, SDR, that kind of stuff.

But our interest is really in the consumer space. We think these kind of like real-time synchronous AI experiences are going to be as ubiquitous as websites and apps in the next kind of like two to five years. So that's our focus and that's how we try and differentiate in terms of opinion into our product and our SDKs and APIs and stuff.

Here are some of the use cases we're seeing. These are also kind of like the usual suspects, AI Girlfriends was the first one. That is like -- I'll get to why that's the first one, I guess. But other ones are like AI NPCs, AI therapists, AI personal trainers, AI toys for kids.

I think that you saw that a couple of sessions ago. These use cases, like we're seeing a lot of different use cases. And I saw it at LiveKit, too, and it got me really, really excited about this stuff. But AI Girlfriends was the first one mainly because everything is so expensive.

Some of these voice platforms, it's, you know, end-to-end upwards of $5 an hour and that doesn't really work for like 90% of the consumer apps. But AI Girlfriends, it works because like the users are paying like -- it's like usually like a credit system. Like you buy credits and you use the app and it uses credit.

So they're more comfortable with that kind of spend. But most consumer use cases, they need something pretty close to free. So we knew that -- and at the time, we were not hosting any voice models. But we knew we had to. We knew that the only way to really get this -- to execute on our vision of putting these experiences everywhere, we had to start bringing more things in-house and running on our own GPUs.

So at the time, open-source, there weren't a lot of good open-source voice models. There were a lot of good ones for asynchronous use cases, so generating voice slower than real-time. But there weren't any really good like real-time streaming ones until Orpheus. Orpheus was the first really good one that was kind of like ready to go.

So Orpheus came out and we're like, okay, this is our time to shine. We immediately like put it on an H100, hosted it, went viral with Jack's tweets, and got a ton of top-of-funnel. And yeah, that was kind of like the starting point. It's like our company -- there's like before Orpheus and after Orpheus, our company kind of changed.

So a little background on what Orpheus is. It's a voice model, but it started as a Lama 3 billion. It was trained on -- pre-trained on like 100,000 hours of voice data and text data as well to make sure it kept its understanding of kind of like language. And it was trained to output audio tokens.

They're called Snack tokens. So that's another open-source project, Snack, which is an audio codec. So it's trained to output the 24-kilohertz version of Snack tokens. Those Snack tokens are then decoded, and then you get audio. You get 24-kilohertz audio. Important thing to note here is it's about 85 Snack tokens for one second of audio.

So Orpheus, wherever you're hosting it, it has to be a throughput of about 85. I mean, you want like 90 to 100 tokens per second to keep up with real-time. Otherwise, you get gaps, obviously, in the audio, and it sounds bad. Other things that were important to us because we're going up to the consumer use cases was cloning.

So our clones need to be emotive and high-fidelity, and one-shot cloning doesn't work that well. That's more true for Orpheus because it only had 100,000 hours of pre-trained data. Whereas I think some of the zero-shot emergent behavior comes at like a million-plus hours. So we're scrappy, I think you can tell by like our design here.

That we're like pretty scrappy, right? We weren't going to fill that gap. So we went with low-rank fine-tunes for our clones. So here's an example. So this is a low-rank fine-tune. We have some better ones. This isn't like the best example, but they're customers, so I didn't want to put it in the thing here.

Jack's voice like yesterday and used 16 rank, alpha 32, basically all the projections. Here's the source audio. Let's see. So that's the source, and then here's the result of a fine-tune. So let me manage expectations here. This was like pretty bad data, like 10 minutes of data. You really want like 30 minutes.

So I had to overfit. So I trained on like five epics. It's pretty overfit, but you'll see it like still sounds okay. Hey, how are you? I'm kind of sick. This is a longer generation. Let's see if it sounds okay. So it's not bad, but you know, my whole life, or most of my, I'm the older brother so most of my life, so I know his voice very well.

So it's jarring to me, but cool thing is like, yeah, it's trained to do these tokens. which is important for consumer, and it's pretty emotive. Like when it said I'm kind of sick, it sounded pretty sad, so it picks up on the language cues as well. Other thing that's really important, obviously, for all voice use cases, not just consumer, is latency.

So there's four things that really affect latency. Time to first token is one of them. Tokens per second is one of them. I'll get into why that is later. But what we found in network latency is another one. But we found the most biggest cause of latency was what we're calling head of line silence.

This is somewhat specific to the Orpheus model, so this isn't gonna be true for all models. But head of line silence is basically that somewhere in the fine tune of Orpheus, the data had a lot of silence at the beginning, because it was voice actors that they hired, and they trained, and they like took those scripts and trained, fine-tuned a model from it.

So this is like the default Orpheus voice, or one of the ones that came with it, called Tara. And it has 600 milliseconds of latency at the beginning. And they probably had other good reasons for adding silence at the beginning. But there's a lot, right? So 600 milliseconds of silence.

We actually found that, oh, so 600 milliseconds of silence. We're running on L40s machines as of now. They can do about 100 tokens a second. So 600 milliseconds is almost half a second of silence. So we are filtering out the silence, like we're not just playing that audio back to the user, but because it takes a while to generate those tokens, we're adding like basically half a second of latency just on wasted compute, pretty much.

So yeah, even filtering out the silence, you're only like saving 10% there, because you're just barely faster than real time. We're scrappy again, so we're running on L40s. But what we found was interesting is that we could actually just fine-tune the silence away. So this is an example of a clone that we did, a LoRa fine-tune of a customer's clone.

And the latency is basically like 100 milliseconds, like P50. So much better, like half a second basically for free. And that matters because these real-time, you kind of have a latency budget on the real-time application. So the way these work is the human talks, and then at some point you decide, is the human done talking?

Those models are not perfect, so you typically add like a snooze period at the end of that. But during that snooze period, you can still do work. So what we do is we kick off the LLM. The way we have our Orpheus stack set up is we start generating audio after two sentences, or if it's done, but two sentences typically, which gives it enough context to like capture the emotions.

So all that to say is if we generate the first audio packet within that snooze period, then we're kind of like in the money on latency, in our latency budget. Now these end-pointing models are going to get better, so you know that snooze period's going to go down to like half a second to a second is probably like the sweet spot.

But one and a half seconds is kind of the threshold I think for anything above that sounds pretty bad. And anything kind of equal to or below that is like acceptable. So yeah, that half a second mattered a lot because it gives our LLM more time to create tokens.

And because we're letting customers bring their own LLMs, it's somewhat out of our control. So the next big category here is infrastructure. Again, we're scrappy, so we really needed something that was robust and not too complicated. And we needed batch inference. So we needed batch inference obviously to save money.

So we need to run multiple generations in the same batch or on the same GPU concurrently. And we also needed multiple LORAs to be running in the same batch on the same GPU. And we wanted one load balancer in front of everything. We're spinning up multiple different models for different languages.

So we all wanted this to sort of be like a black box that just sort of worked. So VLLM to the rescue supports all those things. So VLLM can do batch inference with LORAs, which is really, really awesome. Unfortunately, the FP16 model was slower than real time on L40s.

It worked on H100, but it was slower than real time. But again, VLLM to the rescue. They support FP8 dynamic quantization, which requires basically zero work. It just works automatically. It does all the scaling and everything automatically, so you don't have to train the calibration data into your own quant.

It just works. And it's amazing. So that brought us up to 105 tokens a second on the non-fine-tuned voices and 95 tokens a second on the LoRa voices with a batch of 10, which were, yeah, well in the money in terms of margins and things like that, so that's nice.

Part of the infrastructure is, of course, load balancing. So LoRa's are, depending on what your hyperparameters are, they're between 100 and 200 megabytes. So you want to make sure you end up on a server that has the LoRa and memory and things like that. We also wanted to support, so that's where like sticky session comes in here.

And yeah, latency low, I guess. But we also wanted to support streaming inputs, mainly because the LLM often, you know, might not be done by the time you want to start producing audio, but we also wanted to support arbitrarily long generation, so like storytelling, things like that. So we have, so that's another reason why it, I guess, this load balancing problem is interesting, because you want to make sure you end up on the same GPU across the whole session.

So we went with pretty much like a by-the-book consistent hashring setup. So if you've seen hashrings before, this is not that interesting, but basically the way it works is you hash the servers multiple times, so you want it called virtual nodes, so it distributes around this hashring, and then when a LoRa generation starts, you hash that with the same hashing algorithm, you pick the nearest server to that, and it just works.

And the reason this is chosen is because you can like remove a server, and it doesn't read load balance like everything, just only a few, I guess, migrations that are needed. The other nice thing about this strategy is if a clone gets very popular, it's pretty easy to handle that.

You can just append to the LoRa, so you can just, the more popular LoRa is, you can just add it to more servers and upscale and downscale that very elegantly without really a ton of engineering work. So yeah, at the high level it looks something like this. We have our WebRTC backend that kind of like terminates the client connections, then we use WebSockets to our GPUs, and then the GPUs are talking to Redis, Redis is not the best choice, but if we scale beyond needing Redis for this kind of thing, we can just solve that with piles of money, I guess.

But yeah, the way it works here is you start a session, the WebRTC backend just connects to any GPU, then it asks Redis, "Hey, what GPU is this request supposed to be on?" and then it just proxies it with another TCP connection to the correct GPU, which is fine because these GPUs are in the same data center, private networking, so low latency, TCP, that's totally fine within the same network.

That's pretty much it. I mean, the conclusion here is we're pretty scrappy and we were able to host voice models on GPUs and handle that infrastructure, so you can too. Open source is there and yeah, I think it's gonna unlock a ton of cool use cases. Shout out Swix, he's a supporter of ours and obviously put this on or half of it, I guess, but Swix is awesome, we love them.

Canopy Labs, who created Orpheus, haven't met them, would love to if they're here. And then just free open source software in general, Canopy Labs is built on Llama and Snack, so this whole ecosystem is greater than the sum of its parts, I guess, and LiveKit, we're LiveKit alum, so love those guys, and our WebRTC infras is built on them, and then VLLM, a notable open source project.

And yeah, that's it. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. We'll see you next time.