back to indexServing Voice AI at Scale — Arjun Desai (Cartesia) & Rohit Talluri (AWS)

00:00:00.000 |
My name is Rohit Taluri. I am a member of the Foundation Model Training and Inference Team 00:00:19.120 |
at AWS, and we have an amazing talk, super, super awesome founder here talking about voice AI. 00:00:26.500 |
Awesome. Well, thanks for the intro, Rohit. I'm Arjun. I'm one of the co-founders of Cartesia AI. 00:00:30.740 |
I'll give you a little bit of a spiel about what we do, which is that we build real-time multimodal intelligence 00:00:36.660 |
that runs on any device. So, maybe the topic of this conversation will be focused a little bit more 00:00:43.080 |
on how do we build voice AI for enterprise. As many of you know, voice AI is definitely up and coming 00:00:48.140 |
and one of the places where having interactive models is really critical for being able to get 00:00:56.760 |
So, just a little bit of a recap, right? When you think about foundation models, you know, we often 00:01:02.860 |
think about, you know, models are hosted in the cloud, really large. You know, they're doing things 00:01:07.160 |
kind of in batch mode, right? You know, you submit a request and then you might get something back. 00:01:10.920 |
You know, you're not going to get it back 30 seconds later, but you might get it back, you know, 00:01:14.340 |
in 500, 600 milliseconds. And, you know, that's okay, right? Typically for text, you know, you're not 00:01:19.040 |
also going to be reading at speeds of like 200 tokens per second. And so, it's okay if, you know, 00:01:25.260 |
there's a little bit of a delay and then you get high throughput later. 00:01:27.680 |
When you think about interactive applications like video and voice especially, this is an incredible 00:01:34.320 |
visceral experience that you're giving to your users. Speed is of the utmost importance and quality 00:01:39.760 |
is just table stakes. So, at Cartesia, we're really trying to change the paradigm of how we think about 00:01:45.600 |
foundation models. So, not just be like in the cloud, what you build for batch operations, but actually 00:01:51.120 |
how you bring it to real time, have it cover multiple modalities, and we'll be talking more about voice 00:01:56.540 |
today, and have it run anywhere in the world, not just on the cloud, but on any device. 00:02:00.920 |
So, because we're talking a little bit about voice, I want to maybe give a little bit of a prelude 00:02:05.880 |
why it's so important to think about speed. Imagine you're having a conversation with the 00:02:10.440 |
person next to you, and you probably should do that after this talk, but I think the main thing is that 00:02:15.600 |
it'd feel really awkward, right, if you were trying to talk to someone and they responded a second later. 00:02:19.580 |
So, in voice, you don't have seconds to actually give your response back, you have milliseconds. 00:02:24.220 |
And when you think about voice agents, where you might be calling, you know, customer support on the 00:02:28.320 |
line trying to triage your problem, you're going to get pretty annoyed if it's not responding to you 00:02:32.440 |
and giving you accurate results as soon as possible. So, when you think about voice, you have to deal 00:02:36.920 |
with things like interruptions, you have to think about globalization, these things relate to accents, 00:02:41.680 |
right, background noises that you might be calling from, and at the end of the day, when you actually 00:02:45.900 |
want to deliver this experience, it's incredibly subjective. So, there's so many nuances and 00:02:51.920 |
So, what we've done at Cartesia is really thought about building solutions for voice AI on the 00:02:59.360 |
modeling side from first principles. We care about three things. The first is quality. The naturalness 00:03:05.480 |
of the voice must be exquisite and something that is just table stakes for the kinds of experiences 00:03:10.140 |
that you want to be able to deliver. The second is latency. You want to be able to hear the first 00:03:15.240 |
bite or the first sound of audio on the other line as soon as possible. This gives your end-to-end 00:03:21.380 |
agent a lot more time to be able to do things like reason more, right, and have a little bit more slack 00:03:26.140 |
in the end-to-end system. And third, arguably most important, is controllability. The experience that 00:03:33.480 |
you want to exude by having an agent customized for the things that you want to do is a reflection of 00:03:38.780 |
your brand. You want the agent to be able to talk about what your company is or what you're trying to 00:03:44.360 |
sell in the way that you would. So, being able to customize the voice AI to be able to do this 00:03:49.500 |
is critical and paramount to being able to deliver a great experience to your user. 00:03:55.080 |
So, what we've done is we've pioneered a new architecture called states-based models. These are an 00:04:00.120 |
alternative to transformers, and the main takeaway here is that typically transformers scale 00:04:04.840 |
quadratically, which means the longer inputs get, you get quadratic scaling in your memory as well as in your 00:04:11.780 |
run time. So, you know, the longer inputs, the slower your models get. With states-based models or SSMs, 00:04:17.920 |
generation at inference time is 0 of 1. We maintain a state that you can generate from, and this means that you have 00:04:25.460 |
perfect, very low latency, things that you can't achieve with traditional transformer architectures. 00:04:30.460 |
What we've pioneered is that these states-based models typically have been, you know, lower performing in the recurrent sense 00:04:38.040 |
compared to transformers, but we've closed that gap. And our models actually perform not just better from a latency perspective, 00:04:46.180 |
So, I can give a little bit of a spiel about, you know, why this matters, but I think we're more interested in having more of a 00:04:51.180 |
conversation around voice AI. So, Ro, why don't you kick us off? 00:04:54.020 |
Definitely. So, just so you all know, this was supposed to be a fireside chat. We had some chairs and asking some good questions. 00:04:58.720 |
So, we're gonna open it up to the audience in a little bit. I do have a few questions for you, Arjun. 00:05:02.440 |
You talked a little bit about latency. We hear voice AI challenges like quality and, let's say, the speed of the 00:05:08.580 |
models, as mentioned, the way that they're hosted on the edge, et cetera. What are the challenges that, like, your customers are 00:05:15.580 |
Yeah. So, I think the main thing is that, you know, when you do voice AI and, you know, the main model that we have is Sonic 2, which is 00:05:22.660 |
focused on voice generation. That's one part of the puzzle. You have to, if you're actually trying to build voice agents, you have to 00:05:28.800 |
hook it up with the LM, you have to hook it up with your speech-to-text model. And the biggest issue, honestly, is that there's not 00:05:33.800 |
enough time. You need your LM to actually have a lot of time, you know. These models aren't typically built for, you know, high or low 00:05:41.320 |
latency workflows, and so you want to actually give it the most slack. So, latency is, you know, of utmost importance there. And I can talk a 00:05:47.420 |
little bit about, you know, what we've done to create the fastest model in the world for text-to-speech. The second, I'd say, is 00:05:53.760 |
controllability. So, you know, one thing that we've noticed so much is people use our platform because you're able to get 00:05:59.260 |
amazing quality around things like voice cloning, accents, being able to capture background noises, just so natively in the 00:06:05.840 |
generations that you want. You know, people think it's, like, a little bit of an uncanny valley when you're talking to an agent. It just 00:06:10.680 |
sounds perfect. They like the little phone noises in the background, little beep boops, right? You know, that's kind of what you like to 00:06:15.820 |
hear, right? Or what you expect, right? When you're on a phone call. So, I think those are the two main 00:06:19.860 |
things that I'd like to say that, you know, we've done a really amazing job of getting, nailing those 00:06:23.480 |
down. Yeah. And you mentioned something interesting there, a couple of the use cases, right? Voice cloning, 00:06:27.360 |
the control of the voices, et cetera. Can you talk a little bit about the customer use cases you're seeing 00:06:31.740 |
today at Voice AI? Yeah. So, you know, Voice AI, honestly, has penetrated so much, so many markets out here. 00:06:36.740 |
We look at healthcare, customer support. A lot of this is actually going into real-time gaming, right? When you're 00:06:42.740 |
dealing with non-player characters, you want them to be dynamic, interact, you know, with the players of 00:06:47.960 |
the game. These are, you know, three of many, many markets that, you know, Voice AI has started to grow 00:06:53.160 |
in. And, you know, I think what's really exciting is that because it's growing so fast, you know, it's 00:06:57.720 |
just so great to have partners like AWS who are also, you know, investing in that space and being like, 00:07:01.880 |
look, you know, these are things that we need to support naturally and, you know, people want to use our 00:07:05.740 |
platform for. What about human narration, right? I think we hear a lot about, like, Voice AI taking 00:07:11.500 |
over different types of industries or these use cases as mentioned. Do you see a place for human 00:07:15.700 |
narrators in the future as well? Definitely, yeah. You know, creators are a huge part of the Voice AI 00:07:19.820 |
platform. I think one thing that I think we do a really good job of is we actually have, like, a voice 00:07:23.980 |
marketplace for creators, right? You know, why is this so important? You know, our goal is not to, you know, 00:07:28.680 |
replace voice actors. Actually, it's not what I want to do at all. I think my main goal is, like, 00:07:32.960 |
how do you give them a platform so that their, you know, their, I forgot what, like, their essence, 00:07:38.620 |
right? You know, who they are, their personality can actually be exuded and, like, you know, licensed 00:07:41.960 |
by other people that want to use the platform. And so, you know, this actually gives, you know, 00:07:46.180 |
we've had a lot of voice actors on board on our platform, you know, a great way of amplifying 00:07:49.240 |
them. And yeah, a lot of use cases are honestly focused on narration as well. Got it. So I have a few 00:07:53.800 |
more questions. Maybe we'll end with them. Does anybody have questions in the crowd? Anything about Voice AI? 00:08:00.520 |
Question. I fear this might be a dumb question, but I'm going to ask you. No, please. I've just 00:08:05.240 |
started using Cartesia. It's, like, it's a game changer. Awesome. 00:08:09.220 |
Part of a podcast framework. Yeah, yeah. Because there's an Amazon partnership here, I was just 00:08:15.840 |
wondering, like, Claw doesn't seem to work so well because of the latency. Okay. Is that, is that normal? 00:08:21.640 |
And then one of the models that you usually pay with, and is there, is there plans for... 00:08:26.100 |
Can you repeat the question on your end, just so that... Sure, sure. 00:08:28.980 |
...the log thing? Yeah, yeah. So the question was, you know, 00:08:31.960 |
where you, Cartesia is integrated with PipeCat. You know, PipeCat, Cartesia, we both, you know, 00:08:36.360 |
work with AWS. But Claw, you know, still has, you know, pretty high latency when you're trying 00:08:40.400 |
to, you know, do things end to end. So, you know, this is, you know, a huge, huge problem, right? 00:08:44.340 |
This is why, you know, the Slack that we give you on the TTS side is something, you know, 00:08:47.620 |
you can account for on the LM side. I'll let Ro, you know, take a stab at that question. 00:08:51.880 |
But I know that, like, you know, Claw and, you know, other models, like, you know, have 00:08:55.860 |
different, you know, mechanisms where, you know, you can have, like, a dedicated instance that's 00:08:59.240 |
running so you can get, like, you know, better latency numbers. You know, but a lot of this 00:09:03.100 |
is actually just, you know, it's up to, you know, the LM providers often, right, to make 00:09:06.680 |
these optimizations. I think one thing that, you know, we're really excited about is our goal 00:09:10.320 |
is to, you know, make real-time AI pervasive, right? And so, you know, we're really excited 00:09:14.580 |
about if there's, like, certain applications that, you know, you're really after, you know, 00:09:17.420 |
how can we enable you to do that? So I'm happy to chat after this talk. 00:09:20.400 |
Yeah, of course. And I'll add a couple things to that, too. What's really amazing about Cartesia 00:09:25.400 |
and their background is the development of custom model architectures specifically for this use 00:09:30.040 |
case, which is voice AI. And we spoke a little bit about SSM and linear scaling versus quadratic 00:09:34.180 |
scaling. One thing from the AWS side and what I think is really amazing about our design philosophy 00:09:39.060 |
for our generative AI ecosystem is we want customers to have optionality in our platform. And if you look 00:09:44.960 |
at AWS's model gardens, we have SageMaker Jumpstart and Amazon Bedrock, these model gardens will have a 00:09:50.860 |
host of different types of models for specific types of use cases. And we're on the lookout for the next 00:09:56.340 |
foundation model provider, one like Cartesia, that we can bring into our ecosystem and unlock 00:10:01.220 |
downstream industries that might be underserved by the existing foundation models today. So that's part of 00:10:07.260 |
our strategy, part of our philosophy. And I think what, and I'll say it again, with Cartesia, 00:10:11.620 |
we're really unlocking the voice AI. We need real time AI. We need, we need the ability to post on edge 00:10:17.520 |
devices. And we're unlocking a lot of customers with their models. So hopefully that answers a little bit. 00:10:23.000 |
A little bit. A little bit. Yeah, we can chat a little bit more after. Yeah. Yeah. 00:10:26.900 |
So I have some friends that are creating the video AI research and something they're kind of telling me is that in video AI, what's prevailing theory is that, you know, you don't need so much scale of the actual video data, but rather the density of information 00:10:42.380 |
I'm curious if there's something similar in voice data, whether you believe that, you know, to achieve, you know, next level models, you need, like, the next scale of data, or is it more about, like, the quality of density of that data? 00:10:57.760 |
Yeah, that's a good question. So I think the question was more around the lines of, like, you know, how does voice AI data, you know, compared to, you know, some of the other multimodal data sets in video, you know, one of the prevailing theories was that, you know, you don't actually need a lot of data, you just need it to be very rich, right? 00:11:13.760 |
And, you know, if that's the same in voice AI. I think the short answer to that is yes, but also no. I think, like, when you think about scaling for a lot of models, you know, the generative, you know, in generative AI, right, the general philosophy is like, oh, you know, large-scale pre-training data, and then you have some kind of, like, alignment data or, like, preference data that, you know, you fine-tune on typically. 00:11:36.020 |
I think that holds true for a lot of other modalities as well. It actually holds true for video as well. So my background, I guess I didn't say, was, like, you know, I did my PhD in generative AI at Stanford, was working a lot on, like, image video models there as well. 00:11:49.280 |
But in audio, it's actually quite interesting because, you know, what people want from preference data is actually so diverse that it's, like, you know, you can't capture it purely by, like, a one-stage fine-tuning step. 00:12:00.440 |
So, yeah, you definitely need very rich data. You know, you need high-quality data, of course, but you also need to pair it with information that, like, you know, many different kind of people will want. 00:12:09.120 |
So, you know, this is how traditional LMs are trained. I think a lot of the, like, you know, large open-source, you know, image generation, video generation models also are trained this way. 00:12:17.340 |
But honestly, it just really depends on what you're trying to get after. But, yeah. Sorry, I don't know. I know that's not a satisfying answer, but, yeah. 00:12:26.680 |
What's your take on the speech-to-speech model? I know Amazon has one. It seems immature, I would say. 00:12:33.060 |
But is that the future, or, like, what's happening? 00:12:35.420 |
Yeah, that's a great question. You know, I think there are places where, like, speech-to-speech is valuable right now. 00:12:42.720 |
But I don't think it's at the point where we can actually use it for, you know, production or enterprise-grade use cases. 00:12:49.760 |
And, you know, it's great. You know, I think, you know, Amazon released the speech-to-speech model recently. 00:12:54.320 |
You know, I think there's a few others that are out there as well. 00:12:56.900 |
But I think it's still pretty clear that, like, orchestrated solutions are places that you get a lot of controllability around, like, how you want the different pieces of your system to operate. 00:13:05.100 |
And, you know, from a latency perspective, I think speech-to-speech models will, of course, you know, dominate over time. 00:13:11.760 |
But I think that level of controllability is something that needs to be thought of from first principles when building these systems, right? 00:13:16.700 |
The goal of this system is not to build something really cute. 00:13:18.940 |
It's to build something that actually functions for real-world use cases. 00:13:21.440 |
And we might not be there yet, but I'm sure, you know, we'll get there over time. 00:13:27.540 |
What do you think about local models that you kind of fit for in the future? 00:13:32.080 |
So at Cartesia, like I said, you know, we build for any device. 00:13:34.820 |
So we have models that run locally on edge devices. 00:13:38.780 |
And, you know, I think it's actually really important because, you know, I think cloud models will always be there, right? 00:13:43.980 |
You know, there's certain size models, certain capabilities, like, you know, that you get at, like, certain sizes that you can't just, you know, get, you know, fit on, like, I don't know, very, very edge devices, right? 00:13:53.960 |
Like, not even, like, laptops, but, like, if you get to, like, smaller phones and things like that. 00:13:57.480 |
But there's a lot of applications where, you know, edge devices are just critical, right? 00:14:02.980 |
I think the question becomes, like, where is a network latency plus cloud speed actually slower than edge? 00:14:11.020 |
So running our models on edge are about five times faster than if you were to round trip. 00:14:16.740 |
And so you get, like, just, it's, like, unparalleled latency that you get. 00:14:29.700 |
So what tools do you recommend to monitor, basically, the agent, right? 00:14:33.980 |
Because you're going to be the one that's going to, we're going to call you. 00:14:39.740 |
It's not supposed to say pause three seconds. 00:14:45.120 |
It can be, I don't know, something wrong with AWS. 00:14:47.100 |
But, like, how do you tell us, like, hey, it's not us. 00:14:56.220 |
Evals are an incredibly important part of any system that you build. 00:15:02.080 |
I guess you can message us and ask us if it's a problem. 00:15:05.420 |
But I think what we've found is, you know, typically it's at the LLM stage that we see a lot of these issues. 00:15:10.140 |
You know, you end up, like, you know, oftentimes there's, like, some level of formatting that you need to do for outputs that come from speech to text to make it, like, viable for your LLM and the outputs for your LLM to make it, like, you know, viable or ergonomic for your TTS to handle. 00:15:22.800 |
But we've actually handled a lot of those edge cases for you. 00:15:25.740 |
So over time, you know, with Sonic 2, you know, I'm showing a slide here that's, like, you know, showing most on the latency side, but a lot of the capabilities and quality has also improved substantially. 00:15:33.800 |
So not only did we get our models to be, you know, two and a half times as fast as our initial models, so we can get served at, like, 40 millisecond model latency, but a lot of those, you know, edge case issues that LMs often run into, we know that you don't want to deal with them. 00:15:47.640 |
We deal with them for you to making our system more robust. 00:15:57.100 |
If you look back at what you've done over those past five years, where is Cartesia at? 00:16:03.300 |
Yeah, you know, I think voice AI will be just, like, the de facto norm, right? 00:16:08.640 |
I think in every industry that you see, you're going to have some interaction with voice AI, whether it's on the triaging side, the full N10, you know, support side, you know, gaming, right? 00:16:18.520 |
All those interactions will definitely be, you know, a large part, you know, covered by voice AI. 00:16:23.320 |
But I don't think this is the only thing, right? 00:16:26.140 |
When you think about true interactive models, it's more than just what you can hear, right? 00:16:31.900 |
It's about how you experience the world around you, right? 00:16:34.400 |
So, you know, a lot of these people call it, like, world models, right? 00:16:36.940 |
But I think actually getting these to work in real time, I think, will be really exciting. 00:16:40.560 |
And I'm really looking forward to, like, a world where, you know, we can actually have these kinds of systems work with us as, you know, assistants or co-pilots or, you know, whatever to actually help us understand in the world in ways that we couldn't imagine originally. 00:16:54.160 |
I think that's all we had today and really appreciate it.