back to index

Serving Voice AI at Scale — Arjun Desai (Cartesia) & Rohit Talluri (AWS)


Whisper Transcript | Transcript Only Page

00:00:00.000 | My name is Rohit Taluri. I am a member of the Foundation Model Training and Inference Team
00:00:19.120 | at AWS, and we have an amazing talk, super, super awesome founder here talking about voice AI.
00:00:24.900 | So, Arjun, why don't you go ahead?
00:00:26.500 | Awesome. Well, thanks for the intro, Rohit. I'm Arjun. I'm one of the co-founders of Cartesia AI.
00:00:30.740 | I'll give you a little bit of a spiel about what we do, which is that we build real-time multimodal intelligence
00:00:36.660 | that runs on any device. So, maybe the topic of this conversation will be focused a little bit more
00:00:43.080 | on how do we build voice AI for enterprise. As many of you know, voice AI is definitely up and coming
00:00:48.140 | and one of the places where having interactive models is really critical for being able to get
00:00:53.080 | really good experience for your agents.
00:00:56.760 | So, just a little bit of a recap, right? When you think about foundation models, you know, we often
00:01:02.860 | think about, you know, models are hosted in the cloud, really large. You know, they're doing things
00:01:07.160 | kind of in batch mode, right? You know, you submit a request and then you might get something back.
00:01:10.920 | You know, you're not going to get it back 30 seconds later, but you might get it back, you know,
00:01:14.340 | in 500, 600 milliseconds. And, you know, that's okay, right? Typically for text, you know, you're not
00:01:19.040 | also going to be reading at speeds of like 200 tokens per second. And so, it's okay if, you know,
00:01:25.260 | there's a little bit of a delay and then you get high throughput later.
00:01:27.680 | When you think about interactive applications like video and voice especially, this is an incredible
00:01:34.320 | visceral experience that you're giving to your users. Speed is of the utmost importance and quality
00:01:39.760 | is just table stakes. So, at Cartesia, we're really trying to change the paradigm of how we think about
00:01:45.600 | foundation models. So, not just be like in the cloud, what you build for batch operations, but actually
00:01:51.120 | how you bring it to real time, have it cover multiple modalities, and we'll be talking more about voice
00:01:56.540 | today, and have it run anywhere in the world, not just on the cloud, but on any device.
00:02:00.920 | So, because we're talking a little bit about voice, I want to maybe give a little bit of a prelude
00:02:05.880 | why it's so important to think about speed. Imagine you're having a conversation with the
00:02:10.440 | person next to you, and you probably should do that after this talk, but I think the main thing is that
00:02:15.600 | it'd feel really awkward, right, if you were trying to talk to someone and they responded a second later.
00:02:19.580 | So, in voice, you don't have seconds to actually give your response back, you have milliseconds.
00:02:24.220 | And when you think about voice agents, where you might be calling, you know, customer support on the
00:02:28.320 | line trying to triage your problem, you're going to get pretty annoyed if it's not responding to you
00:02:32.440 | and giving you accurate results as soon as possible. So, when you think about voice, you have to deal
00:02:36.920 | with things like interruptions, you have to think about globalization, these things relate to accents,
00:02:41.680 | right, background noises that you might be calling from, and at the end of the day, when you actually
00:02:45.900 | want to deliver this experience, it's incredibly subjective. So, there's so many nuances and
00:02:50.260 | customizations that you want around voice.
00:02:51.920 | So, what we've done at Cartesia is really thought about building solutions for voice AI on the
00:02:59.360 | modeling side from first principles. We care about three things. The first is quality. The naturalness
00:03:05.480 | of the voice must be exquisite and something that is just table stakes for the kinds of experiences
00:03:10.140 | that you want to be able to deliver. The second is latency. You want to be able to hear the first
00:03:15.240 | bite or the first sound of audio on the other line as soon as possible. This gives your end-to-end
00:03:21.380 | agent a lot more time to be able to do things like reason more, right, and have a little bit more slack
00:03:26.140 | in the end-to-end system. And third, arguably most important, is controllability. The experience that
00:03:33.480 | you want to exude by having an agent customized for the things that you want to do is a reflection of
00:03:38.780 | your brand. You want the agent to be able to talk about what your company is or what you're trying to
00:03:44.360 | sell in the way that you would. So, being able to customize the voice AI to be able to do this
00:03:49.500 | is critical and paramount to being able to deliver a great experience to your user.
00:03:55.080 | So, what we've done is we've pioneered a new architecture called states-based models. These are an
00:04:00.120 | alternative to transformers, and the main takeaway here is that typically transformers scale
00:04:04.840 | quadratically, which means the longer inputs get, you get quadratic scaling in your memory as well as in your
00:04:11.780 | run time. So, you know, the longer inputs, the slower your models get. With states-based models or SSMs,
00:04:17.920 | generation at inference time is 0 of 1. We maintain a state that you can generate from, and this means that you have
00:04:25.460 | perfect, very low latency, things that you can't achieve with traditional transformer architectures.
00:04:30.460 | What we've pioneered is that these states-based models typically have been, you know, lower performing in the recurrent sense
00:04:38.040 | compared to transformers, but we've closed that gap. And our models actually perform not just better from a latency perspective,
00:04:44.180 | but also from a quality perspective.
00:04:46.180 | So, I can give a little bit of a spiel about, you know, why this matters, but I think we're more interested in having more of a
00:04:51.180 | conversation around voice AI. So, Ro, why don't you kick us off?
00:04:54.020 | Definitely. So, just so you all know, this was supposed to be a fireside chat. We had some chairs and asking some good questions.
00:04:58.720 | So, we're gonna open it up to the audience in a little bit. I do have a few questions for you, Arjun.
00:05:02.440 | You talked a little bit about latency. We hear voice AI challenges like quality and, let's say, the speed of the
00:05:08.580 | models, as mentioned, the way that they're hosted on the edge, et cetera. What are the challenges that, like, your customers are
00:05:13.580 | facing? Why are you building for this?
00:05:15.580 | Yeah. So, I think the main thing is that, you know, when you do voice AI and, you know, the main model that we have is Sonic 2, which is
00:05:22.660 | focused on voice generation. That's one part of the puzzle. You have to, if you're actually trying to build voice agents, you have to
00:05:28.800 | hook it up with the LM, you have to hook it up with your speech-to-text model. And the biggest issue, honestly, is that there's not
00:05:33.800 | enough time. You need your LM to actually have a lot of time, you know. These models aren't typically built for, you know, high or low
00:05:41.320 | latency workflows, and so you want to actually give it the most slack. So, latency is, you know, of utmost importance there. And I can talk a
00:05:47.420 | little bit about, you know, what we've done to create the fastest model in the world for text-to-speech. The second, I'd say, is
00:05:53.760 | controllability. So, you know, one thing that we've noticed so much is people use our platform because you're able to get
00:05:59.260 | amazing quality around things like voice cloning, accents, being able to capture background noises, just so natively in the
00:06:05.840 | generations that you want. You know, people think it's, like, a little bit of an uncanny valley when you're talking to an agent. It just
00:06:10.680 | sounds perfect. They like the little phone noises in the background, little beep boops, right? You know, that's kind of what you like to
00:06:15.820 | hear, right? Or what you expect, right? When you're on a phone call. So, I think those are the two main
00:06:19.860 | things that I'd like to say that, you know, we've done a really amazing job of getting, nailing those
00:06:23.480 | down. Yeah. And you mentioned something interesting there, a couple of the use cases, right? Voice cloning,
00:06:27.360 | the control of the voices, et cetera. Can you talk a little bit about the customer use cases you're seeing
00:06:31.740 | today at Voice AI? Yeah. So, you know, Voice AI, honestly, has penetrated so much, so many markets out here.
00:06:36.740 | We look at healthcare, customer support. A lot of this is actually going into real-time gaming, right? When you're
00:06:42.740 | dealing with non-player characters, you want them to be dynamic, interact, you know, with the players of
00:06:47.960 | the game. These are, you know, three of many, many markets that, you know, Voice AI has started to grow
00:06:53.160 | in. And, you know, I think what's really exciting is that because it's growing so fast, you know, it's
00:06:57.720 | just so great to have partners like AWS who are also, you know, investing in that space and being like,
00:07:01.880 | look, you know, these are things that we need to support naturally and, you know, people want to use our
00:07:05.740 | platform for. What about human narration, right? I think we hear a lot about, like, Voice AI taking
00:07:11.500 | over different types of industries or these use cases as mentioned. Do you see a place for human
00:07:15.700 | narrators in the future as well? Definitely, yeah. You know, creators are a huge part of the Voice AI
00:07:19.820 | platform. I think one thing that I think we do a really good job of is we actually have, like, a voice
00:07:23.980 | marketplace for creators, right? You know, why is this so important? You know, our goal is not to, you know,
00:07:28.680 | replace voice actors. Actually, it's not what I want to do at all. I think my main goal is, like,
00:07:32.960 | how do you give them a platform so that their, you know, their, I forgot what, like, their essence,
00:07:38.620 | right? You know, who they are, their personality can actually be exuded and, like, you know, licensed
00:07:41.960 | by other people that want to use the platform. And so, you know, this actually gives, you know,
00:07:46.180 | we've had a lot of voice actors on board on our platform, you know, a great way of amplifying
00:07:49.240 | them. And yeah, a lot of use cases are honestly focused on narration as well. Got it. So I have a few
00:07:53.800 | more questions. Maybe we'll end with them. Does anybody have questions in the crowd? Anything about Voice AI?
00:08:00.520 | Question. I fear this might be a dumb question, but I'm going to ask you. No, please. I've just
00:08:05.240 | started using Cartesia. It's, like, it's a game changer. Awesome.
00:08:09.220 | Part of a podcast framework. Yeah, yeah. Because there's an Amazon partnership here, I was just
00:08:15.840 | wondering, like, Claw doesn't seem to work so well because of the latency. Okay. Is that, is that normal?
00:08:21.640 | And then one of the models that you usually pay with, and is there, is there plans for...
00:08:26.100 | Can you repeat the question on your end, just so that... Sure, sure.
00:08:28.980 | ...the log thing? Yeah, yeah. So the question was, you know,
00:08:31.960 | where you, Cartesia is integrated with PipeCat. You know, PipeCat, Cartesia, we both, you know,
00:08:36.360 | work with AWS. But Claw, you know, still has, you know, pretty high latency when you're trying
00:08:40.400 | to, you know, do things end to end. So, you know, this is, you know, a huge, huge problem, right?
00:08:44.340 | This is why, you know, the Slack that we give you on the TTS side is something, you know,
00:08:47.620 | you can account for on the LM side. I'll let Ro, you know, take a stab at that question.
00:08:51.880 | But I know that, like, you know, Claw and, you know, other models, like, you know, have
00:08:55.860 | different, you know, mechanisms where, you know, you can have, like, a dedicated instance that's
00:08:59.240 | running so you can get, like, you know, better latency numbers. You know, but a lot of this
00:09:03.100 | is actually just, you know, it's up to, you know, the LM providers often, right, to make
00:09:06.680 | these optimizations. I think one thing that, you know, we're really excited about is our goal
00:09:10.320 | is to, you know, make real-time AI pervasive, right? And so, you know, we're really excited
00:09:14.580 | about if there's, like, certain applications that, you know, you're really after, you know,
00:09:17.420 | how can we enable you to do that? So I'm happy to chat after this talk.
00:09:20.400 | Yeah, of course. And I'll add a couple things to that, too. What's really amazing about Cartesia
00:09:25.400 | and their background is the development of custom model architectures specifically for this use
00:09:30.040 | case, which is voice AI. And we spoke a little bit about SSM and linear scaling versus quadratic
00:09:34.180 | scaling. One thing from the AWS side and what I think is really amazing about our design philosophy
00:09:39.060 | for our generative AI ecosystem is we want customers to have optionality in our platform. And if you look
00:09:44.960 | at AWS's model gardens, we have SageMaker Jumpstart and Amazon Bedrock, these model gardens will have a
00:09:50.860 | host of different types of models for specific types of use cases. And we're on the lookout for the next
00:09:56.340 | foundation model provider, one like Cartesia, that we can bring into our ecosystem and unlock
00:10:01.220 | downstream industries that might be underserved by the existing foundation models today. So that's part of
00:10:07.260 | our strategy, part of our philosophy. And I think what, and I'll say it again, with Cartesia,
00:10:11.620 | we're really unlocking the voice AI. We need real time AI. We need, we need the ability to post on edge
00:10:17.520 | devices. And we're unlocking a lot of customers with their models. So hopefully that answers a little bit.
00:10:23.000 | A little bit. A little bit. Yeah, we can chat a little bit more after. Yeah. Yeah.
00:10:26.900 | So I have some friends that are creating the video AI research and something they're kind of telling me is that in video AI, what's prevailing theory is that, you know, you don't need so much scale of the actual video data, but rather the density of information
00:10:42.380 | I'm curious if there's something similar in voice data, whether you believe that, you know, to achieve, you know, next level models, you need, like, the next scale of data, or is it more about, like, the quality of density of that data?
00:10:57.760 | Yeah, that's a good question. So I think the question was more around the lines of, like, you know, how does voice AI data, you know, compared to, you know, some of the other multimodal data sets in video, you know, one of the prevailing theories was that, you know, you don't actually need a lot of data, you just need it to be very rich, right?
00:11:13.760 | And, you know, if that's the same in voice AI. I think the short answer to that is yes, but also no. I think, like, when you think about scaling for a lot of models, you know, the generative, you know, in generative AI, right, the general philosophy is like, oh, you know, large-scale pre-training data, and then you have some kind of, like, alignment data or, like, preference data that, you know, you fine-tune on typically.
00:11:36.020 | I think that holds true for a lot of other modalities as well. It actually holds true for video as well. So my background, I guess I didn't say, was, like, you know, I did my PhD in generative AI at Stanford, was working a lot on, like, image video models there as well.
00:11:49.280 | But in audio, it's actually quite interesting because, you know, what people want from preference data is actually so diverse that it's, like, you know, you can't capture it purely by, like, a one-stage fine-tuning step.
00:12:00.440 | So, yeah, you definitely need very rich data. You know, you need high-quality data, of course, but you also need to pair it with information that, like, you know, many different kind of people will want.
00:12:09.120 | So, you know, this is how traditional LMs are trained. I think a lot of the, like, you know, large open-source, you know, image generation, video generation models also are trained this way.
00:12:17.340 | But honestly, it just really depends on what you're trying to get after. But, yeah. Sorry, I don't know. I know that's not a satisfying answer, but, yeah.
00:12:26.680 | What's your take on the speech-to-speech model? I know Amazon has one. It seems immature, I would say.
00:12:32.380 | Yeah.
00:12:33.060 | But is that the future, or, like, what's happening?
00:12:35.420 | Yeah, that's a great question. You know, I think there are places where, like, speech-to-speech is valuable right now.
00:12:42.720 | But I don't think it's at the point where we can actually use it for, you know, production or enterprise-grade use cases.
00:12:49.760 | And, you know, it's great. You know, I think, you know, Amazon released the speech-to-speech model recently.
00:12:54.320 | You know, I think there's a few others that are out there as well.
00:12:56.900 | But I think it's still pretty clear that, like, orchestrated solutions are places that you get a lot of controllability around, like, how you want the different pieces of your system to operate.
00:13:05.100 | And, you know, from a latency perspective, I think speech-to-speech models will, of course, you know, dominate over time.
00:13:11.760 | But I think that level of controllability is something that needs to be thought of from first principles when building these systems, right?
00:13:16.700 | The goal of this system is not to build something really cute.
00:13:18.940 | It's to build something that actually functions for real-world use cases.
00:13:21.440 | And we might not be there yet, but I'm sure, you know, we'll get there over time.
00:13:24.740 | Yeah?
00:13:27.540 | What do you think about local models that you kind of fit for in the future?
00:13:31.080 | A hundred percent, yeah.
00:13:32.080 | So at Cartesia, like I said, you know, we build for any device.
00:13:34.820 | So we have models that run locally on edge devices.
00:13:38.780 | And, you know, I think it's actually really important because, you know, I think cloud models will always be there, right?
00:13:43.980 | You know, there's certain size models, certain capabilities, like, you know, that you get at, like, certain sizes that you can't just, you know, get, you know, fit on, like, I don't know, very, very edge devices, right?
00:13:53.960 | Like, not even, like, laptops, but, like, if you get to, like, smaller phones and things like that.
00:13:57.480 | But there's a lot of applications where, you know, edge devices are just critical, right?
00:14:02.980 | I think the question becomes, like, where is a network latency plus cloud speed actually slower than edge?
00:14:09.540 | And we've broken that.
00:14:11.020 | So running our models on edge are about five times faster than if you were to round trip.
00:14:16.740 | And so you get, like, just, it's, like, unparalleled latency that you get.
00:14:21.420 | And I think that's quite exciting.
00:14:24.520 | So we're almost up at time.
00:14:26.180 | We'll, maybe one more question.
00:14:27.840 | Sure.
00:14:28.020 | Yeah.
00:14:28.380 | Yeah.
00:14:29.700 | So what tools do you recommend to monitor, basically, the agent, right?
00:14:33.980 | Because you're going to be the one that's going to, we're going to call you.
00:14:36.580 | Yeah.
00:14:36.940 | Something's broken.
00:14:37.560 | It said pause three seconds.
00:14:39.440 | Yeah.
00:14:39.740 | It's not supposed to say pause three seconds.
00:14:40.880 | Right.
00:14:41.180 | So it can be, like, the speech and text.
00:14:42.760 | It can be the LLM.
00:14:43.620 | It can be the text and speech.
00:14:44.880 | Yeah.
00:14:45.120 | It can be, I don't know, something wrong with AWS.
00:14:47.100 | But, like, how do you tell us, like, hey, it's not us.
00:14:50.240 | It's not our regression issue.
00:14:52.420 | How do you, like, bring that into anything?
00:14:54.700 | Yeah.
00:14:55.340 | Yeah.
00:14:55.620 | That's a good question.
00:14:56.220 | Evals are an incredibly important part of any system that you build.
00:14:59.760 | How do we tell you that?
00:15:02.080 | I guess you can message us and ask us if it's a problem.
00:15:05.420 | But I think what we've found is, you know, typically it's at the LLM stage that we see a lot of these issues.
00:15:10.140 | You know, you end up, like, you know, oftentimes there's, like, some level of formatting that you need to do for outputs that come from speech to text to make it, like, viable for your LLM and the outputs for your LLM to make it, like, you know, viable or ergonomic for your TTS to handle.
00:15:22.800 | But we've actually handled a lot of those edge cases for you.
00:15:25.740 | So over time, you know, with Sonic 2, you know, I'm showing a slide here that's, like, you know, showing most on the latency side, but a lot of the capabilities and quality has also improved substantially.
00:15:33.800 | So not only did we get our models to be, you know, two and a half times as fast as our initial models, so we can get served at, like, 40 millisecond model latency, but a lot of those, you know, edge case issues that LMs often run into, we know that you don't want to deal with them.
00:15:46.580 | So we deal with them.
00:15:47.640 | We deal with them for you to making our system more robust.
00:15:50.460 | Got it.
00:15:51.160 | So one more from me.
00:15:52.260 | Like, it's 2025 now.
00:15:54.220 | I think five years in the future, 2030.
00:15:57.100 | If you look back at what you've done over those past five years, where is Cartesia at?
00:16:01.500 | Where is voice AI at?
00:16:02.460 | What do you see?
00:16:03.300 | Yeah, you know, I think voice AI will be just, like, the de facto norm, right?
00:16:08.640 | I think in every industry that you see, you're going to have some interaction with voice AI, whether it's on the triaging side, the full N10, you know, support side, you know, gaming, right?
00:16:18.520 | All those interactions will definitely be, you know, a large part, you know, covered by voice AI.
00:16:23.320 | But I don't think this is the only thing, right?
00:16:26.140 | When you think about true interactive models, it's more than just what you can hear, right?
00:16:31.900 | It's about how you experience the world around you, right?
00:16:34.400 | So, you know, a lot of these people call it, like, world models, right?
00:16:36.940 | But I think actually getting these to work in real time, I think, will be really exciting.
00:16:40.560 | And I'm really looking forward to, like, a world where, you know, we can actually have these kinds of systems work with us as, you know, assistants or co-pilots or, you know, whatever to actually help us understand in the world in ways that we couldn't imagine originally.
00:16:52.400 | Amazing.
00:16:53.060 | Yeah.
00:16:53.380 | Well, thank you.
00:16:54.160 | I think that's all we had today and really appreciate it.
00:16:56.600 | And thank you for sharing a little bit.
00:16:57.820 | Yeah, appreciate it.
00:16:58.780 | Thanks, guys.
00:16:59.100 | Thanks, guys.