Serving Voice AI at $1/hr: Open-source, LoRAs, Latency, Load Balancing

00:00:00.000 | -

00:00:02.000 | - I'm Neil.

00:00:15.860 | That's Jack in the front there, but it's just me.

00:00:19.160 | So yeah, we're really just gonna talk about our experience

00:00:22.800 | hosting Orpheus inference for our real-time stack.

00:00:27.440 | So, I'm Neil.

00:00:29.980 | I'm the CTO at a company called Gabber, a small startup.

00:00:34.280 | But I spent a lot of my career doing real-time media stuff,

00:00:37.240 | so sending audio and video around the internet.

00:00:39.600 | Started at a company called Bebo,

00:00:42.360 | was ultimately acquired by Amazon,

00:00:43.800 | but there I was doing a lot of,

00:00:45.960 | we did like a game streaming app, kind of like OBS,

00:00:48.940 | built a lot of the streaming infrastructure there,

00:00:51.800 | built a ML pipeline to watch people play video games,

00:00:55.660 | so they would watch people play Fortnite

00:00:57.020 | and put some cool effects on the screen

00:00:59.020 | when they got a kill or a victory or something.

00:01:01.600 | So I spent a lot of time in like the G-Streamer trenches

00:01:04.320 | and with WebRTC and RTMP and all that stuff.

00:01:08.360 | Took a detour, worked at Uber for a couple years,

00:01:10.640 | then left that, did a multiplayer gaming startup

00:01:15.320 | with my brother Jack here.

00:01:17.320 | So doing, basically trying to bring like AAA style multiplayer to web games,

00:01:22.040 | so a lot of real, and with voice and stuff too.

00:01:23.620 | So there's a lot of real-time media slash real-time simulation kind of stuff there.

00:01:28.940 | And then yeah, we didn't do a super good job there and shut that company down.

00:01:35.320 | And we were using LiveKit, I made a LiveKit SDK and that segwayed to me working at LiveKit.

00:01:40.660 | I think a lot of people probably heard of LiveKit in this room.

00:01:43.240 | And yeah, the second half of my time at LiveKit, I was spent doing the LiveKit agents platform.

00:01:49.780 | So that's like the platform that was kind of born out of LiveKit's involvement with GPT voice.

00:01:54.860 | So yeah, wrote the first line of code on that and worked on that.

00:01:58.300 | And then yeah, I left LiveKit and did another startup with my brother Gabber.

00:02:02.780 | So that's what we're doing now.

00:02:05.560 | So Gabber is real-time, info for real-time, basically AI personas.

00:02:10.740 | So we have some core building blocks like voice, memory, video inputs coming soon, tool calling,

00:02:15.980 | kind of like the usual suspects, I guess.

00:02:17.780 | But our focus is really on the consumer apps.

00:02:22.420 | We see the replacing human use cases pretty often, like the call center use cases, customer

00:02:28.480 | support, AI, SDR, that kind of stuff.

00:02:31.740 | But our interest is really in the consumer space.

00:02:34.180 | We think these kind of like real-time synchronous AI experiences are going to be as ubiquitous

00:02:39.980 | as websites and apps in the next kind of like two to five years.

00:02:43.020 | So that's our focus and that's how we try and differentiate in terms of opinion into our

00:02:48.140 | product and our SDKs and APIs and stuff.

00:02:52.980 | Here are some of the use cases we're seeing.

00:02:54.940 | These are also kind of like the usual suspects, AI Girlfriends was the first one.

00:02:59.120 | That is like -- I'll get to why that's the first one, I guess.

00:03:02.560 | But other ones are like AI NPCs, AI therapists, AI personal trainers, AI toys for kids.

00:03:09.520 | I think that you saw that a couple of sessions ago.

00:03:11.560 | These use cases, like we're seeing a lot of different use cases.

00:03:13.560 | And I saw it at LiveKit, too, and it got me really, really excited about this stuff.

00:03:17.560 | But AI Girlfriends was the first one mainly because everything is so expensive.

00:03:24.000 | Some of these voice platforms, it's, you know, end-to-end upwards of $5 an hour and that doesn't

00:03:30.060 | really work for like 90% of the consumer apps.

00:03:33.000 | But AI Girlfriends, it works because like the users are paying like -- it's like usually like

00:03:38.560 | a credit system.

00:03:39.560 | Like you buy credits and you use the app and it uses credit.

00:03:42.000 | So they're more comfortable with that kind of spend.

00:03:44.440 | But most consumer use cases, they need something pretty close to free.

00:03:48.440 | So we knew that -- and at the time, we were not hosting any voice models.

00:03:52.600 | But we knew we had to.

00:03:54.060 | We knew that the only way to really get this -- to execute on our vision of putting these

00:03:58.400 | experiences everywhere, we had to start bringing more things in-house and running on our own

00:04:02.500 | GPUs.

00:04:04.700 | So at the time, open-source, there weren't a lot of good open-source voice models.

00:04:09.580 | There were a lot of good ones for asynchronous use cases, so generating voice slower than real-time.

00:04:15.200 | But there weren't any really good like real-time streaming ones until Orpheus.

00:04:20.060 | Orpheus was the first really good one that was kind of like ready to go.

00:04:24.920 | So Orpheus came out and we're like, okay, this is our time to shine.

00:04:28.840 | We immediately like put it on an H100, hosted it, went viral with Jack's tweets, and got

00:04:34.960 | a ton of top-of-funnel.

00:04:37.680 | And yeah, that was kind of like the starting point.

00:04:39.500 | It's like our company -- there's like before Orpheus and after Orpheus, our company kind

00:04:43.100 | of changed.

00:04:45.260 | So a little background on what Orpheus is.

00:04:47.520 | It's a voice model, but it started as a Lama 3 billion.

00:04:51.420 | It was trained on -- pre-trained on like 100,000 hours of voice data and text data as well

00:04:58.480 | to make sure it kept its understanding of kind of like language.

00:05:03.220 | And it was trained to output audio tokens.

00:05:05.060 | They're called Snack tokens.

00:05:06.060 | So that's another open-source project, Snack, which is an audio codec.

00:05:11.980 | So it's trained to output the 24-kilohertz version of Snack tokens.

00:05:16.740 | Those Snack tokens are then decoded, and then you get audio.

00:05:18.900 | You get 24-kilohertz audio.

00:05:21.600 | Important thing to note here is it's about 85 Snack tokens for one second of audio.

00:05:25.880 | So Orpheus, wherever you're hosting it, it has to be a throughput of about 85.

00:05:33.180 | I mean, you want like 90 to 100 tokens per second to keep up with real-time.

00:05:37.140 | Otherwise, you get gaps, obviously, in the audio, and it sounds bad.

00:05:42.400 | Other things that were important to us because we're going up to the consumer use cases was cloning.

00:05:45.960 | So our clones need to be emotive and high-fidelity, and one-shot cloning doesn't work that well.

00:05:55.120 | That's more true for Orpheus because it only had 100,000 hours of pre-trained data.

00:06:00.120 | Whereas I think some of the zero-shot emergent behavior comes at like a million-plus hours.

00:06:05.200 | So we're scrappy, I think you can tell by like our design here.

00:06:09.360 | That we're like pretty scrappy, right?

00:06:11.360 | We weren't going to fill that gap.

00:06:14.360 | So we went with low-rank fine-tunes for our clones.

00:06:17.700 | So here's an example.

00:06:18.700 | So this is a low-rank fine-tune.

00:06:21.640 | We have some better ones.

00:06:22.640 | This isn't like the best example, but they're customers, so I didn't want to put it in the

00:06:25.520 | thing here.

00:06:26.520 | Jack's voice like yesterday and used 16 rank, alpha 32, basically all the projections.

00:06:32.840 | Here's the source audio.

00:06:33.840 | Let's see.

00:06:34.840 | So that's the source, and then here's the result of a fine-tune.

00:06:47.840 | So let me manage expectations here.

00:06:53.200 | This was like pretty bad data, like 10 minutes of data.

00:06:58.220 | You really want like 30 minutes.

00:06:59.840 | So I had to overfit.

00:07:00.840 | So I trained on like five epics.

00:07:03.740 | It's pretty overfit, but you'll see it like still sounds okay.

00:07:07.400 | Hey, how are you?

00:07:11.160 | I'm kind of sick.

00:07:14.200 | This is a longer generation.

00:07:16.060 | Let's see if it sounds okay.

00:07:17.600 | So it's not bad, but you know, my whole life, or most of my, I'm the older brother so most

00:07:23.860 | of my life, so I know his voice very well.

00:07:25.920 | So it's jarring to me, but cool thing is like, yeah, it's trained to do these tokens.

00:07:30.840 | which is important for consumer, and it's pretty emotive.

00:07:35.060 | Like when it said I'm kind of sick, it sounded pretty sad, so it picks up on the language cues

00:07:39.220 | as well.

00:07:43.580 | Other thing that's really important, obviously, for all voice use cases, not just consumer,

00:07:48.740 | is latency.

00:07:49.740 | So there's four things that really affect latency.

00:07:53.040 | Time to first token is one of them.

00:07:55.300 | Tokens per second is one of them.

00:07:58.700 | I'll get into why that is later.

00:08:01.420 | But what we found in network latency is another one.

00:08:04.180 | But we found the most biggest cause of latency was what we're calling head of line silence.

00:08:09.060 | This is somewhat specific to the Orpheus model, so this isn't gonna be true for all models.

00:08:15.860 | But head of line silence is basically that somewhere in the fine tune of Orpheus, the data had a

00:08:22.560 | lot of silence at the beginning, because it was voice actors that they hired, and they trained,

00:08:27.080 | and they like took those scripts and trained, fine-tuned a model from it.

00:08:31.240 | So this is like the default Orpheus voice, or one of the ones that came with it, called Tara.

00:08:35.060 | And it has 600 milliseconds of latency at the beginning.

00:08:38.640 | And they probably had other good reasons for adding silence at the beginning.

00:08:43.940 | But there's a lot, right?

00:08:45.740 | So 600 milliseconds of silence.

00:08:48.060 | We actually found that, oh, so 600 milliseconds of silence.

00:08:52.100 | We're running on L40s machines as of now.

00:08:56.520 | They can do about 100 tokens a second.

00:08:58.620 | So 600 milliseconds is almost half a second of silence.

00:09:02.620 | So we are filtering out the silence, like we're not just playing that audio back to the user,

00:09:06.200 | but because it takes a while to generate those tokens, we're adding like basically half a second

00:09:10.860 | of latency just on wasted compute, pretty much.

00:09:16.440 | So yeah, even filtering out the silence, you're only like saving 10% there, because you're just

00:09:19.700 | barely faster than real time.

00:09:21.780 | We're scrappy again, so we're running on L40s.

00:09:25.900 | But what we found was interesting is that we could actually just fine-tune the silence away.

00:09:29.060 | So this is an example of a clone that we did, a LoRa fine-tune of a customer's clone.

00:09:34.700 | And the latency is basically like 100 milliseconds, like P50.

00:09:39.240 | So much better, like half a second basically for free.

00:09:43.560 | And that matters because these real-time, you kind of have a latency budget on the real-time

00:09:48.200 | application.

00:09:49.200 | So the way these work is the human talks, and then at some point you decide, is the human

00:09:54.300 | done talking?

00:09:55.440 | Those models are not perfect, so you typically add like a snooze period at the end of that.

00:09:59.940 | But during that snooze period, you can still do work.

00:10:03.100 | So what we do is we kick off the LLM.

00:10:06.060 | The way we have our Orpheus stack set up is we start generating audio after two sentences,

00:10:11.540 | or if it's done, but two sentences typically, which gives it enough context to like capture

00:10:15.760 | the emotions.

00:10:17.700 | So all that to say is if we generate the first audio packet within that snooze period,

00:10:21.860 | then we're kind of like in the money on latency, in our latency budget.

00:10:25.360 | Now these end-pointing models are going to get better, so you know that snooze period's

00:10:29.020 | going to go down to like half a second to a second is probably like the sweet spot.

00:10:33.020 | But one and a half seconds is kind of the threshold I think for anything above that sounds pretty

00:10:37.600 | bad.

00:10:38.600 | And anything kind of equal to or below that is like acceptable.

00:10:42.860 | So yeah, that half a second mattered a lot because it gives our LLM more time to create

00:10:48.220 | tokens.

00:10:49.220 | And because we're letting customers bring their own LLMs, it's somewhat out of our control.

00:10:55.900 | So the next big category here is infrastructure.

00:10:58.500 | Again, we're scrappy, so we really needed something that was robust and not too complicated.

00:11:07.460 | And we needed batch inference.

00:11:08.460 | So we needed batch inference obviously to save money.

00:11:10.500 | So we need to run multiple generations in the same batch or on the same GPU concurrently.

00:11:18.460 | And we also needed multiple LORAs to be running in the same batch on the same GPU.

00:11:23.920 | And we wanted one load balancer in front of everything.

00:11:25.900 | We're spinning up multiple different models for different languages.

00:11:28.600 | So we all wanted this to sort of be like a black box that just sort of worked.

00:11:34.640 | So VLLM to the rescue supports all those things.

00:11:38.140 | So VLLM can do batch inference with LORAs, which is really, really awesome.

00:11:46.100 | Unfortunately, the FP16 model was slower than real time on L40s.

00:11:49.100 | It worked on H100, but it was slower than real time.

00:11:52.100 | But again, VLLM to the rescue.

00:11:55.100 | They support FP8 dynamic quantization, which requires basically zero work.

00:12:00.100 | It just works automatically.

00:12:02.060 | It does all the scaling and everything automatically, so you don't have to train the calibration data

00:12:08.060 | into your own quant.

00:12:10.060 | It just works.

00:12:11.060 | And it's amazing.

00:12:12.060 | So that brought us up to 105 tokens a second on the non-fine-tuned voices and 95 tokens a second

00:12:19.060 | on the LoRa voices with a batch of 10, which were, yeah, well in the money in terms of margins

00:12:27.020 | and things like that, so that's nice.

00:12:30.020 | Part of the infrastructure is, of course, load balancing.

00:12:33.020 | So LoRa's are, depending on what your hyperparameters are, they're between 100 and 200 megabytes.

00:12:40.020 | So you want to make sure you end up on a server that has the LoRa and memory and things like

00:12:44.980 | that.

00:12:45.980 | We also wanted to support, so that's where like sticky session comes in here.

00:12:51.980 | And yeah, latency low, I guess.

00:12:53.980 | But we also wanted to support streaming inputs, mainly because the LLM often, you know, might

00:12:59.380 | not be done by the time you want to start producing audio, but we also wanted to support arbitrarily

00:13:03.780 | long generation, so like storytelling, things like that.

00:13:08.020 | So we have, so that's another reason why it, I guess, this load balancing problem is interesting,

00:13:14.180 | because you want to make sure you end up on the same GPU across the whole session.

00:13:19.380 | So we went with pretty much like a by-the-book consistent hashring setup.

00:13:24.620 | So if you've seen hashrings before, this is not that interesting, but basically the way

00:13:28.180 | it works is you hash the servers multiple times, so you want it called virtual nodes, so it

00:13:33.540 | distributes around this hashring, and then when a LoRa generation starts, you hash that

00:13:38.500 | with the same hashing algorithm, you pick the nearest server to that, and it just works.

00:13:44.220 | And the reason this is chosen is because you can like remove a server, and it doesn't read

00:13:49.220 | load balance like everything, just only a few, I guess, migrations that are needed.

00:13:55.980 | The other nice thing about this strategy is if a clone gets very popular, it's pretty easy

00:14:01.420 | to handle that.

00:14:02.420 | You can just append to the LoRa, so you can just, the more popular LoRa is, you can just

00:14:07.580 | add it to more servers and upscale and downscale that very elegantly without really a ton of

00:14:13.540 | engineering work.

00:14:14.540 | So yeah, at the high level it looks something like this.

00:14:19.540 | We have our WebRTC backend that kind of like terminates the client connections, then we use

00:14:23.700 | WebSockets to our GPUs, and then the GPUs are talking to Redis, Redis is not the best choice,

00:14:32.460 | but if we scale beyond needing Redis for this kind of thing, we can just solve that with piles

00:14:38.060 | of money, I guess.

00:14:40.100 | But yeah, the way it works here is you start a session, the WebRTC backend just connects

00:14:44.940 | to any GPU, then it asks Redis, "Hey, what GPU is this request supposed to be on?" and

00:14:51.720 | then it just proxies it with another TCP connection to the correct GPU, which is fine because these

00:14:56.220 | GPUs are in the same data center, private networking, so low latency, TCP, that's totally fine within

00:15:02.420 | the same network.

00:15:05.700 | That's pretty much it.

00:15:06.700 | I mean, the conclusion here is we're pretty scrappy and we were able to host voice models

00:15:13.000 | on GPUs and handle that infrastructure, so you can too.

00:15:16.860 | Open source is there and yeah, I think it's gonna unlock a ton of cool use cases.

00:15:24.100 | Shout out Swix, he's a supporter of ours and obviously put this on or half of it, I guess,

00:15:31.240 | but Swix is awesome, we love them.

00:15:33.800 | Canopy Labs, who created Orpheus, haven't met them, would love to if they're here.

00:15:39.760 | And then just free open source software in general, Canopy Labs is built on Llama and Snack, so

00:15:45.280 | this whole ecosystem is greater than the sum of its parts, I guess, and LiveKit, we're LiveKit

00:15:51.120 | alum, so love those guys, and our WebRTC infras is built on them, and then VLLM, a notable open

00:15:59.340 | source project.

00:16:01.900 | And yeah, that's it.

00:16:02.680 | Thank you.

00:16:06.240 | Thank you.

00:16:06.240 | We'll see you next time.

Serving Voice AI at $1/hr: Open-source, LoRAs, Latency, Load Balancing - Neil Dwyer, Gabber

Chapters