Serving Voice AI at Scale — Arjun Desai (Cartesia) & Rohit Talluri (AWS)

My name is Rohit Taluri. I am a member of the Foundation Model Training and Inference Team at AWS, and we have an amazing talk, super, super awesome founder here talking about voice AI. So, Arjun, why don't you go ahead? Awesome. Well, thanks for the intro, Rohit. I'm Arjun. I'm one of the co-founders of Cartesia AI.

I'll give you a little bit of a spiel about what we do, which is that we build real-time multimodal intelligence that runs on any device. So, maybe the topic of this conversation will be focused a little bit more on how do we build voice AI for enterprise. As many of you know, voice AI is definitely up and coming and one of the places where having interactive models is really critical for being able to get really good experience for your agents.

So, just a little bit of a recap, right? When you think about foundation models, you know, we often think about, you know, models are hosted in the cloud, really large. You know, they're doing things kind of in batch mode, right? You know, you submit a request and then you might get something back.

You know, you're not going to get it back 30 seconds later, but you might get it back, you know, in 500, 600 milliseconds. And, you know, that's okay, right? Typically for text, you know, you're not also going to be reading at speeds of like 200 tokens per second. And so, it's okay if, you know, there's a little bit of a delay and then you get high throughput later.

When you think about interactive applications like video and voice especially, this is an incredible visceral experience that you're giving to your users. Speed is of the utmost importance and quality is just table stakes. So, at Cartesia, we're really trying to change the paradigm of how we think about foundation models.

So, not just be like in the cloud, what you build for batch operations, but actually how you bring it to real time, have it cover multiple modalities, and we'll be talking more about voice today, and have it run anywhere in the world, not just on the cloud, but on any device.

So, because we're talking a little bit about voice, I want to maybe give a little bit of a prelude why it's so important to think about speed. Imagine you're having a conversation with the person next to you, and you probably should do that after this talk, but I think the main thing is that it'd feel really awkward, right, if you were trying to talk to someone and they responded a second later.

So, in voice, you don't have seconds to actually give your response back, you have milliseconds. And when you think about voice agents, where you might be calling, you know, customer support on the line trying to triage your problem, you're going to get pretty annoyed if it's not responding to you and giving you accurate results as soon as possible.

So, when you think about voice, you have to deal with things like interruptions, you have to think about globalization, these things relate to accents, right, background noises that you might be calling from, and at the end of the day, when you actually want to deliver this experience, it's incredibly subjective.

So, there's so many nuances and customizations that you want around voice. So, what we've done at Cartesia is really thought about building solutions for voice AI on the modeling side from first principles. We care about three things. The first is quality. The naturalness of the voice must be exquisite and something that is just table stakes for the kinds of experiences that you want to be able to deliver.

The second is latency. You want to be able to hear the first bite or the first sound of audio on the other line as soon as possible. This gives your end-to-end agent a lot more time to be able to do things like reason more, right, and have a little bit more slack in the end-to-end system.

And third, arguably most important, is controllability. The experience that you want to exude by having an agent customized for the things that you want to do is a reflection of your brand. You want the agent to be able to talk about what your company is or what you're trying to sell in the way that you would.

So, being able to customize the voice AI to be able to do this is critical and paramount to being able to deliver a great experience to your user. So, what we've done is we've pioneered a new architecture called states-based models. These are an alternative to transformers, and the main takeaway here is that typically transformers scale quadratically, which means the longer inputs get, you get quadratic scaling in your memory as well as in your run time.

So, you know, the longer inputs, the slower your models get. With states-based models or SSMs, generation at inference time is 0 of 1. We maintain a state that you can generate from, and this means that you have perfect, very low latency, things that you can't achieve with traditional transformer architectures.

What we've pioneered is that these states-based models typically have been, you know, lower performing in the recurrent sense compared to transformers, but we've closed that gap. And our models actually perform not just better from a latency perspective, but also from a quality perspective. So, I can give a little bit of a spiel about, you know, why this matters, but I think we're more interested in having more of a conversation around voice AI.

So, Ro, why don't you kick us off? Definitely. So, just so you all know, this was supposed to be a fireside chat. We had some chairs and asking some good questions. So, we're gonna open it up to the audience in a little bit. I do have a few questions for you, Arjun.

You talked a little bit about latency. We hear voice AI challenges like quality and, let's say, the speed of the models, as mentioned, the way that they're hosted on the edge, et cetera. What are the challenges that, like, your customers are facing? Why are you building for this? Yeah.

So, I think the main thing is that, you know, when you do voice AI and, you know, the main model that we have is Sonic 2, which is focused on voice generation. That's one part of the puzzle. You have to, if you're actually trying to build voice agents, you have to hook it up with the LM, you have to hook it up with your speech-to-text model.

And the biggest issue, honestly, is that there's not enough time. You need your LM to actually have a lot of time, you know. These models aren't typically built for, you know, high or low latency workflows, and so you want to actually give it the most slack. So, latency is, you know, of utmost importance there.

And I can talk a little bit about, you know, what we've done to create the fastest model in the world for text-to-speech. The second, I'd say, is controllability. So, you know, one thing that we've noticed so much is people use our platform because you're able to get amazing quality around things like voice cloning, accents, being able to capture background noises, just so natively in the generations that you want.

You know, people think it's, like, a little bit of an uncanny valley when you're talking to an agent. It just sounds perfect. They like the little phone noises in the background, little beep boops, right? You know, that's kind of what you like to hear, right? Or what you expect, right?

When you're on a phone call. So, I think those are the two main things that I'd like to say that, you know, we've done a really amazing job of getting, nailing those down. Yeah. And you mentioned something interesting there, a couple of the use cases, right? Voice cloning, the control of the voices, et cetera.

Can you talk a little bit about the customer use cases you're seeing today at Voice AI? Yeah. So, you know, Voice AI, honestly, has penetrated so much, so many markets out here. We look at healthcare, customer support. A lot of this is actually going into real-time gaming, right? When you're dealing with non-player characters, you want them to be dynamic, interact, you know, with the players of the game.

These are, you know, three of many, many markets that, you know, Voice AI has started to grow in. And, you know, I think what's really exciting is that because it's growing so fast, you know, it's just so great to have partners like AWS who are also, you know, investing in that space and being like, look, you know, these are things that we need to support naturally and, you know, people want to use our platform for.

What about human narration, right? I think we hear a lot about, like, Voice AI taking over different types of industries or these use cases as mentioned. Do you see a place for human narrators in the future as well? Definitely, yeah. You know, creators are a huge part of the Voice AI platform.

I think one thing that I think we do a really good job of is we actually have, like, a voice marketplace for creators, right? You know, why is this so important? You know, our goal is not to, you know, replace voice actors. Actually, it's not what I want to do at all.

I think my main goal is, like, how do you give them a platform so that their, you know, their, I forgot what, like, their essence, right? You know, who they are, their personality can actually be exuded and, like, you know, licensed by other people that want to use the platform.

And so, you know, this actually gives, you know, we've had a lot of voice actors on board on our platform, you know, a great way of amplifying them. And yeah, a lot of use cases are honestly focused on narration as well. Got it. So I have a few more questions.

Maybe we'll end with them. Does anybody have questions in the crowd? Anything about Voice AI? Question. I fear this might be a dumb question, but I'm going to ask you. No, please. I've just started using Cartesia. It's, like, it's a game changer. Awesome. Part of a podcast framework. Yeah, yeah.

Because there's an Amazon partnership here, I was just wondering, like, Claw doesn't seem to work so well because of the latency. Okay. Is that, is that normal? And then one of the models that you usually pay with, and is there, is there plans for... Can you repeat the question on your end, just so that...

Sure, sure. ...the log thing? Yeah, yeah. So the question was, you know, where you, Cartesia is integrated with PipeCat. You know, PipeCat, Cartesia, we both, you know, work with AWS. But Claw, you know, still has, you know, pretty high latency when you're trying to, you know, do things end to end.

So, you know, this is, you know, a huge, huge problem, right? This is why, you know, the Slack that we give you on the TTS side is something, you know, you can account for on the LM side. I'll let Ro, you know, take a stab at that question. But I know that, like, you know, Claw and, you know, other models, like, you know, have different, you know, mechanisms where, you know, you can have, like, a dedicated instance that's running so you can get, like, you know, better latency numbers.

You know, but a lot of this is actually just, you know, it's up to, you know, the LM providers often, right, to make these optimizations. I think one thing that, you know, we're really excited about is our goal is to, you know, make real-time AI pervasive, right? And so, you know, we're really excited about if there's, like, certain applications that, you know, you're really after, you know, how can we enable you to do that?

So I'm happy to chat after this talk. Yeah, of course. And I'll add a couple things to that, too. What's really amazing about Cartesia and their background is the development of custom model architectures specifically for this use case, which is voice AI. And we spoke a little bit about SSM and linear scaling versus quadratic scaling.

One thing from the AWS side and what I think is really amazing about our design philosophy for our generative AI ecosystem is we want customers to have optionality in our platform. And if you look at AWS's model gardens, we have SageMaker Jumpstart and Amazon Bedrock, these model gardens will have a host of different types of models for specific types of use cases.

And we're on the lookout for the next foundation model provider, one like Cartesia, that we can bring into our ecosystem and unlock downstream industries that might be underserved by the existing foundation models today. So that's part of our strategy, part of our philosophy. And I think what, and I'll say it again, with Cartesia, we're really unlocking the voice AI.

We need real time AI. We need, we need the ability to post on edge devices. And we're unlocking a lot of customers with their models. So hopefully that answers a little bit. A little bit. A little bit. Yeah, we can chat a little bit more after. Yeah. Yeah. So I have some friends that are creating the video AI research and something they're kind of telling me is that in video AI, what's prevailing theory is that, you know, you don't need so much scale of the actual video data, but rather the density of information I'm curious if there's something similar in voice data, whether you believe that, you know, to achieve, you know, next level models, you need, like, the next scale of data, or is it more about, like, the quality of density of that data?

Yeah, that's a good question. So I think the question was more around the lines of, like, you know, how does voice AI data, you know, compared to, you know, some of the other multimodal data sets in video, you know, one of the prevailing theories was that, you know, you don't actually need a lot of data, you just need it to be very rich, right?

And, you know, if that's the same in voice AI. I think the short answer to that is yes, but also no. I think, like, when you think about scaling for a lot of models, you know, the generative, you know, in generative AI, right, the general philosophy is like, oh, you know, large-scale pre-training data, and then you have some kind of, like, alignment data or, like, preference data that, you know, you fine-tune on typically.

I think that holds true for a lot of other modalities as well. It actually holds true for video as well. So my background, I guess I didn't say, was, like, you know, I did my PhD in generative AI at Stanford, was working a lot on, like, image video models there as well.

But in audio, it's actually quite interesting because, you know, what people want from preference data is actually so diverse that it's, like, you know, you can't capture it purely by, like, a one-stage fine-tuning step. So, yeah, you definitely need very rich data. You know, you need high-quality data, of course, but you also need to pair it with information that, like, you know, many different kind of people will want.

So, you know, this is how traditional LMs are trained. I think a lot of the, like, you know, large open-source, you know, image generation, video generation models also are trained this way. But honestly, it just really depends on what you're trying to get after. But, yeah. Sorry, I don't know.

I know that's not a satisfying answer, but, yeah. What's your take on the speech-to-speech model? I know Amazon has one. It seems immature, I would say. Yeah. But is that the future, or, like, what's happening? Yeah, that's a great question. You know, I think there are places where, like, speech-to-speech is valuable right now.

But I don't think it's at the point where we can actually use it for, you know, production or enterprise-grade use cases. And, you know, it's great. You know, I think, you know, Amazon released the speech-to-speech model recently. You know, I think there's a few others that are out there as well.

But I think it's still pretty clear that, like, orchestrated solutions are places that you get a lot of controllability around, like, how you want the different pieces of your system to operate. And, you know, from a latency perspective, I think speech-to-speech models will, of course, you know, dominate over time.

But I think that level of controllability is something that needs to be thought of from first principles when building these systems, right? The goal of this system is not to build something really cute. It's to build something that actually functions for real-world use cases. And we might not be there yet, but I'm sure, you know, we'll get there over time.

Yeah? What do you think about local models that you kind of fit for in the future? A hundred percent, yeah. So at Cartesia, like I said, you know, we build for any device. So we have models that run locally on edge devices. And, you know, I think it's actually really important because, you know, I think cloud models will always be there, right?

You know, there's certain size models, certain capabilities, like, you know, that you get at, like, certain sizes that you can't just, you know, get, you know, fit on, like, I don't know, very, very edge devices, right? Like, not even, like, laptops, but, like, if you get to, like, smaller phones and things like that.

But there's a lot of applications where, you know, edge devices are just critical, right? I think the question becomes, like, where is a network latency plus cloud speed actually slower than edge? And we've broken that. So running our models on edge are about five times faster than if you were to round trip.

And so you get, like, just, it's, like, unparalleled latency that you get. And I think that's quite exciting. So we're almost up at time. We'll, maybe one more question. Sure. Yeah. Yeah. So what tools do you recommend to monitor, basically, the agent, right? Because you're going to be the one that's going to, we're going to call you.

Yeah. Something's broken. It said pause three seconds. Yeah. It's not supposed to say pause three seconds. Right. So it can be, like, the speech and text. It can be the LLM. It can be the text and speech. Yeah. It can be, I don't know, something wrong with AWS. But, like, how do you tell us, like, hey, it's not us.

It's not our regression issue. How do you, like, bring that into anything? Yeah. Yeah. That's a good question. Evals are an incredibly important part of any system that you build. How do we tell you that? I guess you can message us and ask us if it's a problem. But I think what we've found is, you know, typically it's at the LLM stage that we see a lot of these issues.

You know, you end up, like, you know, oftentimes there's, like, some level of formatting that you need to do for outputs that come from speech to text to make it, like, viable for your LLM and the outputs for your LLM to make it, like, you know, viable or ergonomic for your TTS to handle.

But we've actually handled a lot of those edge cases for you. So over time, you know, with Sonic 2, you know, I'm showing a slide here that's, like, you know, showing most on the latency side, but a lot of the capabilities and quality has also improved substantially. So not only did we get our models to be, you know, two and a half times as fast as our initial models, so we can get served at, like, 40 millisecond model latency, but a lot of those, you know, edge case issues that LMs often run into, we know that you don't want to deal with them.

So we deal with them. We deal with them for you to making our system more robust. Got it. So one more from me. Like, it's 2025 now. I think five years in the future, 2030. If you look back at what you've done over those past five years, where is Cartesia at?

Where is voice AI at? What do you see? Yeah, you know, I think voice AI will be just, like, the de facto norm, right? I think in every industry that you see, you're going to have some interaction with voice AI, whether it's on the triaging side, the full N10, you know, support side, you know, gaming, right?

All those interactions will definitely be, you know, a large part, you know, covered by voice AI. But I don't think this is the only thing, right? When you think about true interactive models, it's more than just what you can hear, right? It's about how you experience the world around you, right?

So, you know, a lot of these people call it, like, world models, right? But I think actually getting these to work in real time, I think, will be really exciting. And I'm really looking forward to, like, a world where, you know, we can actually have these kinds of systems work with us as, you know, assistants or co-pilots or, you know, whatever to actually help us understand in the world in ways that we couldn't imagine originally.

Amazing. Yeah. Well, thank you. I think that's all we had today and really appreciate it. And thank you for sharing a little bit. Yeah, appreciate it. Thanks, guys. Thanks, guys.

Serving Voice AI at Scale — Arjun Desai (Cartesia) & Rohit Talluri (AWS)

Transcript