back to indexServing Voice AI at $1/hr: Open-source, LoRAs, Latency, Load Balancing - Neil Dwyer, Gabber

Chapters
0:0 Introduction to Gabber and Real-Time AI
2:15 Gabber's Mission for Consumer AI
4:17 The Orpheus Voice Model
5:43 Challenges in Voice Cloning
7:44 Latency Management and "Head of Line Silence"
11:7 Infrastructure for Batch Inference
11:36 Leveraging vLLM and Dynamic Quantization
13:21 Load Balancing with a Consistent Hash Ring
14:17 System Architecture Overview
15:7 Conclusion and Open Source Shout-outs
00:00:15.860 |
That's Jack in the front there, but it's just me. 00:00:19.160 |
So yeah, we're really just gonna talk about our experience 00:00:22.800 |
hosting Orpheus inference for our real-time stack. 00:00:29.980 |
I'm the CTO at a company called Gabber, a small startup. 00:00:34.280 |
But I spent a lot of my career doing real-time media stuff, 00:00:37.240 |
so sending audio and video around the internet. 00:00:45.960 |
we did like a game streaming app, kind of like OBS, 00:00:48.940 |
built a lot of the streaming infrastructure there, 00:00:51.800 |
built a ML pipeline to watch people play video games, 00:00:59.020 |
when they got a kill or a victory or something. 00:01:01.600 |
So I spent a lot of time in like the G-Streamer trenches 00:01:08.360 |
Took a detour, worked at Uber for a couple years, 00:01:10.640 |
then left that, did a multiplayer gaming startup 00:01:17.320 |
So doing, basically trying to bring like AAA style multiplayer to web games, 00:01:22.040 |
so a lot of real, and with voice and stuff too. 00:01:23.620 |
So there's a lot of real-time media slash real-time simulation kind of stuff there. 00:01:28.940 |
And then yeah, we didn't do a super good job there and shut that company down. 00:01:35.320 |
And we were using LiveKit, I made a LiveKit SDK and that segwayed to me working at LiveKit. 00:01:40.660 |
I think a lot of people probably heard of LiveKit in this room. 00:01:43.240 |
And yeah, the second half of my time at LiveKit, I was spent doing the LiveKit agents platform. 00:01:49.780 |
So that's like the platform that was kind of born out of LiveKit's involvement with GPT voice. 00:01:54.860 |
So yeah, wrote the first line of code on that and worked on that. 00:01:58.300 |
And then yeah, I left LiveKit and did another startup with my brother Gabber. 00:02:05.560 |
So Gabber is real-time, info for real-time, basically AI personas. 00:02:10.740 |
So we have some core building blocks like voice, memory, video inputs coming soon, tool calling, 00:02:17.780 |
But our focus is really on the consumer apps. 00:02:22.420 |
We see the replacing human use cases pretty often, like the call center use cases, customer 00:02:31.740 |
But our interest is really in the consumer space. 00:02:34.180 |
We think these kind of like real-time synchronous AI experiences are going to be as ubiquitous 00:02:39.980 |
as websites and apps in the next kind of like two to five years. 00:02:43.020 |
So that's our focus and that's how we try and differentiate in terms of opinion into our 00:02:54.940 |
These are also kind of like the usual suspects, AI Girlfriends was the first one. 00:02:59.120 |
That is like -- I'll get to why that's the first one, I guess. 00:03:02.560 |
But other ones are like AI NPCs, AI therapists, AI personal trainers, AI toys for kids. 00:03:09.520 |
I think that you saw that a couple of sessions ago. 00:03:11.560 |
These use cases, like we're seeing a lot of different use cases. 00:03:13.560 |
And I saw it at LiveKit, too, and it got me really, really excited about this stuff. 00:03:17.560 |
But AI Girlfriends was the first one mainly because everything is so expensive. 00:03:24.000 |
Some of these voice platforms, it's, you know, end-to-end upwards of $5 an hour and that doesn't 00:03:30.060 |
really work for like 90% of the consumer apps. 00:03:33.000 |
But AI Girlfriends, it works because like the users are paying like -- it's like usually like 00:03:39.560 |
Like you buy credits and you use the app and it uses credit. 00:03:42.000 |
So they're more comfortable with that kind of spend. 00:03:44.440 |
But most consumer use cases, they need something pretty close to free. 00:03:48.440 |
So we knew that -- and at the time, we were not hosting any voice models. 00:03:54.060 |
We knew that the only way to really get this -- to execute on our vision of putting these 00:03:58.400 |
experiences everywhere, we had to start bringing more things in-house and running on our own 00:04:04.700 |
So at the time, open-source, there weren't a lot of good open-source voice models. 00:04:09.580 |
There were a lot of good ones for asynchronous use cases, so generating voice slower than real-time. 00:04:15.200 |
But there weren't any really good like real-time streaming ones until Orpheus. 00:04:20.060 |
Orpheus was the first really good one that was kind of like ready to go. 00:04:24.920 |
So Orpheus came out and we're like, okay, this is our time to shine. 00:04:28.840 |
We immediately like put it on an H100, hosted it, went viral with Jack's tweets, and got 00:04:37.680 |
And yeah, that was kind of like the starting point. 00:04:39.500 |
It's like our company -- there's like before Orpheus and after Orpheus, our company kind 00:04:47.520 |
It's a voice model, but it started as a Lama 3 billion. 00:04:51.420 |
It was trained on -- pre-trained on like 100,000 hours of voice data and text data as well 00:04:58.480 |
to make sure it kept its understanding of kind of like language. 00:05:06.060 |
So that's another open-source project, Snack, which is an audio codec. 00:05:11.980 |
So it's trained to output the 24-kilohertz version of Snack tokens. 00:05:16.740 |
Those Snack tokens are then decoded, and then you get audio. 00:05:21.600 |
Important thing to note here is it's about 85 Snack tokens for one second of audio. 00:05:25.880 |
So Orpheus, wherever you're hosting it, it has to be a throughput of about 85. 00:05:33.180 |
I mean, you want like 90 to 100 tokens per second to keep up with real-time. 00:05:37.140 |
Otherwise, you get gaps, obviously, in the audio, and it sounds bad. 00:05:42.400 |
Other things that were important to us because we're going up to the consumer use cases was cloning. 00:05:45.960 |
So our clones need to be emotive and high-fidelity, and one-shot cloning doesn't work that well. 00:05:55.120 |
That's more true for Orpheus because it only had 100,000 hours of pre-trained data. 00:06:00.120 |
Whereas I think some of the zero-shot emergent behavior comes at like a million-plus hours. 00:06:05.200 |
So we're scrappy, I think you can tell by like our design here. 00:06:14.360 |
So we went with low-rank fine-tunes for our clones. 00:06:22.640 |
This isn't like the best example, but they're customers, so I didn't want to put it in the 00:06:26.520 |
Jack's voice like yesterday and used 16 rank, alpha 32, basically all the projections. 00:06:34.840 |
So that's the source, and then here's the result of a fine-tune. 00:06:53.200 |
This was like pretty bad data, like 10 minutes of data. 00:07:03.740 |
It's pretty overfit, but you'll see it like still sounds okay. 00:07:17.600 |
So it's not bad, but you know, my whole life, or most of my, I'm the older brother so most 00:07:25.920 |
So it's jarring to me, but cool thing is like, yeah, it's trained to do these tokens. 00:07:30.840 |
which is important for consumer, and it's pretty emotive. 00:07:35.060 |
Like when it said I'm kind of sick, it sounded pretty sad, so it picks up on the language cues 00:07:43.580 |
Other thing that's really important, obviously, for all voice use cases, not just consumer, 00:07:49.740 |
So there's four things that really affect latency. 00:08:01.420 |
But what we found in network latency is another one. 00:08:04.180 |
But we found the most biggest cause of latency was what we're calling head of line silence. 00:08:09.060 |
This is somewhat specific to the Orpheus model, so this isn't gonna be true for all models. 00:08:15.860 |
But head of line silence is basically that somewhere in the fine tune of Orpheus, the data had a 00:08:22.560 |
lot of silence at the beginning, because it was voice actors that they hired, and they trained, 00:08:27.080 |
and they like took those scripts and trained, fine-tuned a model from it. 00:08:31.240 |
So this is like the default Orpheus voice, or one of the ones that came with it, called Tara. 00:08:35.060 |
And it has 600 milliseconds of latency at the beginning. 00:08:38.640 |
And they probably had other good reasons for adding silence at the beginning. 00:08:48.060 |
We actually found that, oh, so 600 milliseconds of silence. 00:08:58.620 |
So 600 milliseconds is almost half a second of silence. 00:09:02.620 |
So we are filtering out the silence, like we're not just playing that audio back to the user, 00:09:06.200 |
but because it takes a while to generate those tokens, we're adding like basically half a second 00:09:10.860 |
of latency just on wasted compute, pretty much. 00:09:16.440 |
So yeah, even filtering out the silence, you're only like saving 10% there, because you're just 00:09:21.780 |
We're scrappy again, so we're running on L40s. 00:09:25.900 |
But what we found was interesting is that we could actually just fine-tune the silence away. 00:09:29.060 |
So this is an example of a clone that we did, a LoRa fine-tune of a customer's clone. 00:09:34.700 |
And the latency is basically like 100 milliseconds, like P50. 00:09:39.240 |
So much better, like half a second basically for free. 00:09:43.560 |
And that matters because these real-time, you kind of have a latency budget on the real-time 00:09:49.200 |
So the way these work is the human talks, and then at some point you decide, is the human 00:09:55.440 |
Those models are not perfect, so you typically add like a snooze period at the end of that. 00:09:59.940 |
But during that snooze period, you can still do work. 00:10:06.060 |
The way we have our Orpheus stack set up is we start generating audio after two sentences, 00:10:11.540 |
or if it's done, but two sentences typically, which gives it enough context to like capture 00:10:17.700 |
So all that to say is if we generate the first audio packet within that snooze period, 00:10:21.860 |
then we're kind of like in the money on latency, in our latency budget. 00:10:25.360 |
Now these end-pointing models are going to get better, so you know that snooze period's 00:10:29.020 |
going to go down to like half a second to a second is probably like the sweet spot. 00:10:33.020 |
But one and a half seconds is kind of the threshold I think for anything above that sounds pretty 00:10:38.600 |
And anything kind of equal to or below that is like acceptable. 00:10:42.860 |
So yeah, that half a second mattered a lot because it gives our LLM more time to create 00:10:49.220 |
And because we're letting customers bring their own LLMs, it's somewhat out of our control. 00:10:55.900 |
So the next big category here is infrastructure. 00:10:58.500 |
Again, we're scrappy, so we really needed something that was robust and not too complicated. 00:11:08.460 |
So we needed batch inference obviously to save money. 00:11:10.500 |
So we need to run multiple generations in the same batch or on the same GPU concurrently. 00:11:18.460 |
And we also needed multiple LORAs to be running in the same batch on the same GPU. 00:11:23.920 |
And we wanted one load balancer in front of everything. 00:11:25.900 |
We're spinning up multiple different models for different languages. 00:11:28.600 |
So we all wanted this to sort of be like a black box that just sort of worked. 00:11:34.640 |
So VLLM to the rescue supports all those things. 00:11:38.140 |
So VLLM can do batch inference with LORAs, which is really, really awesome. 00:11:46.100 |
Unfortunately, the FP16 model was slower than real time on L40s. 00:11:49.100 |
It worked on H100, but it was slower than real time. 00:11:55.100 |
They support FP8 dynamic quantization, which requires basically zero work. 00:12:02.060 |
It does all the scaling and everything automatically, so you don't have to train the calibration data 00:12:12.060 |
So that brought us up to 105 tokens a second on the non-fine-tuned voices and 95 tokens a second 00:12:19.060 |
on the LoRa voices with a batch of 10, which were, yeah, well in the money in terms of margins 00:12:30.020 |
Part of the infrastructure is, of course, load balancing. 00:12:33.020 |
So LoRa's are, depending on what your hyperparameters are, they're between 100 and 200 megabytes. 00:12:40.020 |
So you want to make sure you end up on a server that has the LoRa and memory and things like 00:12:45.980 |
We also wanted to support, so that's where like sticky session comes in here. 00:12:53.980 |
But we also wanted to support streaming inputs, mainly because the LLM often, you know, might 00:12:59.380 |
not be done by the time you want to start producing audio, but we also wanted to support arbitrarily 00:13:03.780 |
long generation, so like storytelling, things like that. 00:13:08.020 |
So we have, so that's another reason why it, I guess, this load balancing problem is interesting, 00:13:14.180 |
because you want to make sure you end up on the same GPU across the whole session. 00:13:19.380 |
So we went with pretty much like a by-the-book consistent hashring setup. 00:13:24.620 |
So if you've seen hashrings before, this is not that interesting, but basically the way 00:13:28.180 |
it works is you hash the servers multiple times, so you want it called virtual nodes, so it 00:13:33.540 |
distributes around this hashring, and then when a LoRa generation starts, you hash that 00:13:38.500 |
with the same hashing algorithm, you pick the nearest server to that, and it just works. 00:13:44.220 |
And the reason this is chosen is because you can like remove a server, and it doesn't read 00:13:49.220 |
load balance like everything, just only a few, I guess, migrations that are needed. 00:13:55.980 |
The other nice thing about this strategy is if a clone gets very popular, it's pretty easy 00:14:02.420 |
You can just append to the LoRa, so you can just, the more popular LoRa is, you can just 00:14:07.580 |
add it to more servers and upscale and downscale that very elegantly without really a ton of 00:14:14.540 |
So yeah, at the high level it looks something like this. 00:14:19.540 |
We have our WebRTC backend that kind of like terminates the client connections, then we use 00:14:23.700 |
WebSockets to our GPUs, and then the GPUs are talking to Redis, Redis is not the best choice, 00:14:32.460 |
but if we scale beyond needing Redis for this kind of thing, we can just solve that with piles 00:14:40.100 |
But yeah, the way it works here is you start a session, the WebRTC backend just connects 00:14:44.940 |
to any GPU, then it asks Redis, "Hey, what GPU is this request supposed to be on?" and 00:14:51.720 |
then it just proxies it with another TCP connection to the correct GPU, which is fine because these 00:14:56.220 |
GPUs are in the same data center, private networking, so low latency, TCP, that's totally fine within 00:15:06.700 |
I mean, the conclusion here is we're pretty scrappy and we were able to host voice models 00:15:13.000 |
on GPUs and handle that infrastructure, so you can too. 00:15:16.860 |
Open source is there and yeah, I think it's gonna unlock a ton of cool use cases. 00:15:24.100 |
Shout out Swix, he's a supporter of ours and obviously put this on or half of it, I guess, 00:15:33.800 |
Canopy Labs, who created Orpheus, haven't met them, would love to if they're here. 00:15:39.760 |
And then just free open source software in general, Canopy Labs is built on Llama and Snack, so 00:15:45.280 |
this whole ecosystem is greater than the sum of its parts, I guess, and LiveKit, we're LiveKit 00:15:51.120 |
alum, so love those guys, and our WebRTC infras is built on them, and then VLLM, a notable open