Back to Index

Pipecat Cloud: Enterprise Voice Agents Built On Open Source - Kwindla Hultman Kramer, Daily


Transcript

. Hi, everybody. My name is Quinn. I am a co-founder of a company called Daly. Daly's other founder is in the back there, Nina. I'm stepping in for my colleague, Mark, who couldn't make it today. So we're going to do this fast and very informally. But I think that's a good way to do it at an engineering conference.

I don't have as much code to show as the last awesome presentation, but I'll try to show a little bit. We're going to talk about building voice agents today. I work on an open-source, vendor-neutral project called PipeCat, and a lot of other people at Daly do, too, because voice AI is growing fast and is super interesting and is a good fit for what we do as a company.

We started in 2016. We are global infrastructure for real-time audio, video, and now AI for developers. PipeCat sits somewhere higher up in the stack than our traditional infrastructure business. So we'll talk a little bit about how you can build reliable, performant voice AI agents completely using open-source software. We also recently launched a layer just on top of our traditional infrastructure designed for hosting voice AI agents.

We'll talk just a little bit about that. So we've been doing this a long time. We care a lot about the developer experience for very fast, very responsive, real-time audio and video. We have a long list of engineering first we're proud of, but that's not why you're here today.

Happy to talk about that later, though. If we step back and orient a little bit, what are you doing when you build a voice agent? I tend to sort of orient people with three things they have to think about. You've got to write the code. You have to deploy that code somewhere.

And then you have to connect users over the network or over a telephony connection to that agent. A few things here. User expectations are high. Voice AI is new, but it's growing fast, I think, because we're able to with sort of the best technologies that are just now becoming available to meet user expectations, but users expect the AI to understand what they're saying, to feel smart and conversational and human, to be connected to knowledge bases, to have actual access to useful information for whatever they are doing for that user.

To sound natural, there's definitely an uncanny valley problem that in generative AI we fell into for a very long time. Now we're on the other side of that for voice AI, which is really exciting. The agents have to respond fast. Humans expect -- it varies by language and by culture and by individual, but roughly speaking, humans expect a 500 millisecond response time in natural human conversation.

If you don't do that in your voice AI interface, you are probably going to lose most of your normal users. So we tell people target 800 millisecond voice-to-voice response times. That's not easy to do with today's technology, but it is definitely possible. And build UIs very thoughtfully to understand that humans expect fast responses.

The other thing that's hard, a little bit like fast response times, is knowing when to respond. Humans are good but not perfect at knowing when somebody we're talking to is done talking, and when we should start talking. Voice AI agents are not as good at that yet, but they're getting better, so we'll talk a tiny bit about that.

So why do developers use a framework like pipecat instead of writing all the code themselves? Well, a little bit of it is all those hard things on the previous slide that you probably don't want to write the code for yourself if you're mostly thinking about your business logic and your user experience and connecting to all of your systems.

We want to use battle-tested implementations of things like turn detection, interruption handling, context management, calling out to other tools, function calling in an asynchronous environment, all that stuff. So developers tend to use frameworks these days for lots of agentic things they do. And voice AI, I think, is even more important to sit on top of really well-tested infrastructure and code components than even in other domains.

Pipecat appeals to developers because it's 100% open source and completely vendor-neutral. You can use it with lots of different providers at every single layer of the stack that pipecat enables. For example, there's native telephony support in pipecat, so you can use pipecat with lots of different telephony providers in a plug-and-play manner.

You can use Twilio, for example, which a lot of developers know if you're in a geography like India where Twilio doesn't have phone numbers, you can use Plevo. A bunch of other telephony providers are supported. There's a native audio smart turn model that's completely open source in pipecat. So the community has gotten large enough that there's kind of cutting-edge ML research, at least in the small model domain coming out of this open source community, which is really fun.

Pipecat Cloud, I think, is a really nice advantage for the pipecat ecosystem. It's the first open source voice AI cloud sort of built from the ground up to host code that you write but that is designed for the problems of voice AI. And pipecat supports a lot of models and services that count to something like 60 plus.

All the things you would want to use in a voice AI agent are probably in pipecat main branch. So you probably don't have to write code to get started, though the appeal is that you can write lots and lots of code if you want to so there's no ceiling.

I'll talk a little bit about what the architecture looks like and we probably won't have time to talk about client SDKs because most of you in this room are probably building for telephony use cases. But there's a really rich and growing set of JavaScript, React, iOS, Android, client side components and SDKs that people in the pipecat community are using to build multimodal applications that run in the web and on native mobile platforms.

So we talked about this, so I will actually just skip this slide. I hope we'll have time for Q&A. That's the most fun part. Here's the other piece that often helps orient people. This is what a pipecat agent looks like. So you're building a pipeline of programmable media handling elements.

These are all written in Python, although lots of the performance sensitive ones bottom out in some kind of C code. It's pretty common in real time media handling. You probably don't have to worry about that level, though. You're probably just thinking in Python. Pipcat pipelines can be really simple.

They can have just a couple, maybe just three elements, something for the network, something that's doing some processing and something that's sending stuff back out the network, or they can be quite complicated. And we see enterprise voice agents often become quite complicated because they're doing complicated things and connecting out to a large variety of existing legacy systems.

So example of a little bit of that span, the left two screenshots here from the pipecat docs about how you work with the open AI audio centric models in pipecat. Open AI gives you a couple of different shapes of models and APIs that you can use. One is chaining together transcription, large language model operating in text mode, and voice output.

The other is using their new and, in some ways, experimental speech-to-speech models, which are also really awesome and promising. You can do either of those approaches in pipecat just by changing probably three or four lines of code. On the right is the Python core chunk of a few hundred lines of Python code and a flow diagram for a more complicated pipeline.

This is one of my favorite starter kit examples for pipecat. It uses two instances of the Gemini multimodal live API in audio native mode. And one is the conversational flow, and the other is another participant in the conversation that plays a game with the user. So there's sort of an LLM as a judge pattern here, but in the context of a game.

And you're moving the audio frames around through both pipelines selectively, depending on the results of the real-time inference, which is a pattern we also see in enterprise use cases, but it's fun to clone this and run it and play the game. We listed some of the services here. We can talk a lot more if you want to in the Q&A about sort of what we see people actually using in production most often in terms of models and services in enterprise voice AI.

So that's a very quick rundown of the pipecat framework, which is how you write the code. Now, how do you deploy it, and why am I talking to you about pipecat cloud today? There are a bunch of hard things about voice AI that are unique to these use cases.

These are long-running sessions. They have to use network protocols that are designed for low latency. things like auto-scaling are not available out of the box for these workloads the way they are for something like HTTP workloads. So I was actually quite resistant for a long time to building anything commercial around pipecat at daily, because we do the low-level infrastructure.

We already have things that we do that serve the pipecat community. But it got to the point where a very large percentage of the questions in the pipecat Discord were about how to deploy and scale. And I initially sort of felt like that was a solved enough problem, because what we do in the infrastructure level helps you in one way.

What a lot of our friends and customers do much higher up in the stack with platforms that sort of wrap all of the voice AI problem set in very easy-to-use dashboards and tools and GUIs are also really good solutions. But what we came to realize is that there was sort of a middle of the stack that people were asking about a lot in the open-source community that boiled down to, "How do I do my Kubernetes?" So people would ask questions in the pipecat Discord about deployment and scaling, and we would say, "Oh, well, if you really want to run this stuff yourself on your own infrastructure, here are the five things you do in Kubernetes," and people would say, "Some version of Kuber-what?" And we don't have a good answer to that, so we thought we'd come up with a good answer to that, which is a very thin layer on top of our existing global media-oriented real-time infrastructure designed as what I think of -- this is not a very good marketing tagline -- but I think of this as a very thin wrapper around Docker and Kubernetes optimized for voice AI.

So what are the things we're trying to solve for? So fast start times are very important. If somebody calls your voice agent and they hear ringing, they want to hear that voice agent pick up the phone and say, "Hello," pretty fast. Almost no matter what you do in AI, you care about cold start times, but it's even more important when the user is initiating some action and expects you to hear audio back.

Cold starts are hard. If you've built an AI infrastructure, you know that. We try to solve the cold start problem for voice AI. Happy to talk about cold starts in great detail, because it's something I've been thinking a lot about over the last few months. Autoscaling is a little bit related to cold starts.

You want your resources to expand as your traffic pattern expands. The alternative is you know exactly what your traffic pattern is, and you just deploy a bunch of resources. That doesn't work for most workloads. Most people have time dependent or completely unpredictable workloads, so you need to scale up and scale down.

Real-time is different from non-real-time, and by non-real-time, I mean everything that's not conversational latency of a few hundred milliseconds or less. If you are making an HTTP request, you want it to be fast, but you don't really care if your P95 is fifteen hundred milliseconds or two thousand milliseconds.

In most cases, in a voice AI conversation, you care a lot if your P95 goes up above eight hundred, nine hundred, a thousand milliseconds for the entire voice-to-voice response chain. All the little inference calls you make as part of that have to be much faster than that by definition.

So the whole networking stack from client to wherever your pipecat code is running and inside that Kubernetes cluster has to be optimized for real-time. You probably need global deployment. You probably have GDPR or data residency or other kinds of data privacy requirements. Or you just need global deployment because you want these servers close to users because that helps with latency.

And all these things have to be like delivered at reasonable cost, so we try to take these things off of your plate and help you build quickly and get to market with your voice agents. A couple other things that are just worth flagging here. We've done a lot of work on turn detection, which is sort of one of the 2025 top three problems most people in voice AI are thinking about.

How to make better. Check out the open source smart turn model that's part of the pipecat ecosystem if you're interested in that. The open source smart turn model is built into pipecat cloud and runs for free. Our friends at FAL host it. You've probably heard of FAL if you're doing Gen AI stuff.

Very fast, very good, GPU optimized inference. And ambient noise and background voices. So one problem with voice AI is that even though transcription models today are very resistant, resilient to all kinds of noisy environments, the LLMs themselves are not. So if you are trying to do transcription and figure out when people are talking and figure out when to fire inference down the chain and ask your LLMs to do something, having background noise that sounds a little bit like speech will trigger lots of interruptions that you don't mean to happen and will inject lots of spurious pseudospeech into your transcripts.

And that's true even for speech-to-speech models today. They're not very resilient to background noise. The best solution to background noise today is a commercial model from a really great small company called Crisp. The Crisp model is only available with sort of big chunk of commercial licensing. You can use Crisp for free inside pipecat cloud if you run on pipecat cloud.

You can also use Crisp in your own pipecat pipelines with your own license if you run Crisp somewhere else. Finally, agents are non-deterministic. As we all know, there's a whole evals and PM track here and in every other track we talk about this problem. We've got some nice low-level building blocks for logging and observability natively in pipecat and exposed through pipecat cloud and a bunch of partners we work with on that.

I'm happy to introduce you to the great teams we work with at various companies that are building observability stuff. That is my speed run. I came in 20 seconds under the 15 minutes, but because we are the last talk in this block, if people want to do Q&A, totally happy to.

Thanks, Ryan. One, actually, I have two questions, a very quick question. One is, we're based on Sydney, Australia. One of the problems we've gone into the 800-millisecond thing is the time to go and call OpenAI and come back and automate Australia. So open-air processing is . Do you have any alternatives for that?

Have you looked at other alternatives for people outside the states? Yes. That's a great question. So the question -- I will repeat the question. The question is, if you're in a geography that is a long way from your inference servers. So in the case of this particular question, you're serving users in Australia.

You're using OpenAI. OpenAI only has inference servers in the U.S. You don't want to make extra round trips to the U.S. So there's a couple answers to that. One is, if you make one long haul to the U.S. for all the audio at the beginning of the chain and at the end of the chain, that is much better than making three inference round trips for transcription, OpenAI, and voice generation.

So that's one tool. We often say to people, just deploy close to the inference servers rather than close to the users and optimize for having one long trip and then a bunch of very, very fast short trips. That's good but not great. The other option is to run stuff using OpenWeights models locally in Australia, which you can definitely do.

It's a longer conversation about what use cases you can use, say, the best OpenWeights models versus the GPT-40 and Gemini 2 Flash level models. But there are definitely some voice AI workloads now that you can reliably run on, like the Gemma or the QIN3 or the Llama 4 models.

Second question, maybe just related to that is, let's say, if we basically boost models in Australia itself, what's the interconnectivity with the network from your cloud by GATT? Do you go through the internet exchange locally out there? Yes, so we have endpoints all over the world that are -- in our world, we call them points of presence.

So we have sort of the edge server close to the user and we'll terminate the WebRTC or the telephony connection there. And then we'll route over our own private AWS or OCI backbones to wherever you need to route to. If you're hosting in Australia, you should be able to just hit our endpoints and then you're hosting in Australia.

So we also -- we have some regional availability of PipeCat Cloud now. We will launch a bunch more regional availability of PipeCat Cloud over the next quarter. So I hope we actually have PipeCat Cloud in Australia soon. Although you can also obviously self-host in Australia and still use either PipeCat itself or PipeCat Plus Daily in other ways.

Thank you. Yeah. Oh, sorry. Thanks for the talk. Thank you. So there are others like Moshi. Yeah, yeah, yeah, I love Moshi. They basically claim that turn detection is no longer needed because they, in parallel, encode both the speaker and the language model. Do you have experience running those?

Do they actually scale? Yes. The question is about a really cool open weights model called Moshi by a French lab called Kyotai. Moshi is a sort of next-generation research model where the architecture is constant bidirectional streaming. So you're always streaming tokens in and the model is always streaming tokens out.

In a conversational voice situation, which Moshi was designed for, most of the tokens streaming out are silenced tokens of some kind. And when they're not silenced tokens, it's because the model decided it was going to do whatever the model is trained to do. Which is really cool because that can mean not just that the model does natural turn taking, but also that the model can do things like back channeling.

So the model can do the things the humans do, that its data set has audio for, like, when you're talking, I can say, hmm, ah, yeah, mm-hmm, yeah, uh-huh. And it's not actually a new inference call. It's just streaming. That paper, the Kyotai Labs Moshi architecture paper was my very favorite ML research paper from last year.

Now, that model itself is not usable in production for a bunch of reasons, including that it is too small a language model to be useful for basically any real-world use case. I have more to say about that, but I'm super, super excited about that architecture. But I don't think -- I mean, we're a couple years away from that architecture being actually usable and trained as a production model.

There are speech-to-speech models from the large labs that are closer to being able to be used in production. Now, they are not streaming architecture models, but they are native audio speech-to-speech models, which have a bunch of advantages, including really great multilingual support. So, like, mixed-language stuff is great from those models.

In theory, latency reductions. So, OpenAI has a real-time model called GPT-40 Audio Preview that sits behind their real-time API. It's a good model. Gemini 2.0 Flash is usable in an audio-to-audio mode, and they're training to -- or they have preview releases of 2.5 Flash. These models are now good enough that you can use them for use cases where you are more concerned about naturalness of the human conversation than you are about reliable instruction following and function calling.

They are less reliable in audio mode than the SOTA models operating in text mode. So, what we generally see is that for a small subset of voice AI use cases today that are really about, like, conversational dynamics, narrative, storytelling, those models are starting to get adopted. For the majority of sort of enterprise voice AI use cases where you really need best possible instruction following and function calling, those models are not yet the right choice.

But they are getting better every release, and all of us expect the world to move to speech-to-speech models being the default for, like, 95% of voice AI sometime in the next two years. The question is when, in your use case, will a particular model architecture sort of cross that threshold in your evals.

Sorry, what about Sesame? Did you put Sesame in that same bucket as Gemini in OpenAI, or -- Sesame is closer to Moshi. In fact, Sesame -- so there's another open weights -- or partly open weights and really interesting model called Sesame. It's a little like Moshi. It, in fact, uses the Moshi neural encoder.

Yeah, it uses Mimi. Sesame -- so Sesame has not yet been fully released. There isn't a full Sesame release. Also, I think Sesame is smaller than probably you would need to use for most enterprise use cases today. Although the lab training Sesame, I think, has bigger versions coming. There's also a speech-to-speech model called Ultravox, which is really good, which is trained on the Llama 3, 70B backbone.

And that team supports that model and has a production voice AI API. That model is worth trying if you are really interested in speech-to-speech models. If Llama 3, 70B can do what you want, I think Ultravox is a good choice. If Llama 3, 70B isn't quite there for your use case, probably not.

But, you know, the next release of Ultravox. So speech-to-speech is definitely the future. I generally tell people experiment with it. Don't necessarily start assuming you're going to use it for your enterprise use case, though, today. Given your vendor neutrality, can you speak to the strengths and weaknesses of using, like, the leading-edge multimodal input models like OpenAI and Gemini?

When should I choose OpenAI or when should I choose Gemini? So my opinion is that GPT 4.0 in text mode and Gemini 2.0 Flash in text mode are roughly equivalent models for the use cases that I test every day. So I would make the decision -- if you can, I would build a PipeCat pipeline and then just swap the two models and run your evals.

Because they're both really good models. One of the advantages of Gemini is that it's extremely aggressively priced. So, you know, a 30-minute conversation on Gemini is probably 10 times cheaper than a 30-minute conversation on GPT 4.0. You know, that may or may not stay true as they both change their prices.

But that's definitely something we hear a lot from customers today is that they like the pricing of Gemini. The other interesting thing about Gemini is that it operates in native audio input mode very well. So you can use Gemini in native audio input mode and then text output mode in a pipeline.

And that has advantages for some use cases in some languages. And you can, again, test that on your evals. And OpenAI also has native audio support in some of their newer models. But I think they're just a little bit behind the Gemini models in that regard. Time for one more or are we done?

One more and then we're done. Yeah. What are the general advantages of speech-to-speech versus going speech-to-text, doing something and then going back to text-to-speech? What are the general advantages of speech-to-speech instead of text, text, inference-to-speech and out? So it's a super interesting question. And I have a practical answer and a philosophical answer.

I'll keep them both short. The practical answer is that you lose information when you transcribe. And so if there's information that's useful in the transcription step, if there's information in the audio that you would lose that's useful for your use case, then a speech-to-speech model is great. So, for example, things like mixed language are very hard for small transcription models.

You're almost always sort of losing a bunch more information in a mixed language transcription than you are in, like, an optimized model monolingual transcription. So why not go to the big LLM that just has all this, like, language knowledge and can do a better job on the multilingual input?

The other advantage is potentially you have lower latency. Like, if you've trained an end-to-end model for speech-to-speech and it's all one model and you're not, like, chaining together inference calls, you can probably get lower latency. In practice, whether that's true today depends more on the sort of APIs and inference stack than it does on the model architecture.

But I think we're all going towards assuming that we just want to do one inference call for, like, the bulk of things, and then we might use other little models on the side for, like, subsets. The philosophical answer, though, is that those advantages are probably outweighed by the challenges to today's LLM architecture is when you have big context.

And big context -- and audio tokens take up a lot of context tokens. So when you're operating in audio mode, you're just sort of expanding the context massively relative to operating in text mode. And that tends to degrade the performance of the model. I think a little bit relatedly, nobody has as much audio data as they have text data for training.

So even though a big model is doing a bunch of transfer learning when you give it a bunch of audio, and it is, in theory, sort of mapping all that audio to the same latent space as its text reasoning, in practice, it's definitely not doing that exactly. It's doing something like that, but not that.

And so because we don't have as much audio data, you see a lot of issues with audio to audio models, like the model will sometimes just respond in a totally different language. And that's cool, but it's never what you want in the enterprise, you know, voice AI use case.

And the best guess for why that's happening is it's in some right part of the latent space from some projection, but then from some other projection, it's totally in a different part of the latent space when you gave it audio instead of text. Even though if you transcribed that text, it would be exactly the same as the audio.

So, you know, latent spaces are big, and to, like, actually find our way through them in post training, you really have to have a lot of data, and nobody has enough audio data yet. But the big labs are going to fix that, because audio matters and multi-turn conversations matter.

.