Realtime Conversational Video with Pipecat and Tavus — Chad Bailey and Brian Johnson, Daily & Tavus

We're here to talk about real-time conversational video with PipeCat, that's me, and with Tavis, that's Brian. We'll introduce ourselves a little bit more, but in the interest of keeping it moving, let's talk about what we're here for. Have any of you ever seen one of these robot concierge things?

Do they work? No, they don't. They're terrible, right? It's actually possible nowadays to build this kind of thing, but actually good. It's a little bit tricky, but that's what we're here to show you how to do. There are three things you need to think about when you want to build real-time AI.

The first is your models. Obviously, we all know what models are. That's why we're here at this conference. The thing that you don't necessarily know you need to think about is your orchestration layer. We're going to talk a little bit about that. Then, of course, you need to deploy these bots somewhere.

That's the third step, deployment. We'll talk about that as well. Step one, models. I come from a little bit more of a traditional, funny to say that, voice AI world where the traditional pipeline people talk about is speech-to-text, so transcription, and then LLMs for your inference, and then text-to-speech.

That's the typical kind of cascading pipeline you hear. Sure enough, people nowadays are using some voice-to-voice models. That is a use case for this kind of thing, but there are reasons sometimes you might use one or the other. Real-time video is a lot more complicated. It doesn't have to be, but it can be, and I think Brian will tell you that it should be.

There's a lot more stuff you need to think about to do video generation in real-time. So, Brian, you want to tell us a little bit about Tavis and how you all are thinking about this? Thanks, Chad. So, Tavis started out as an AI research company, and we started off with a single model that was like a rendering model.

What we quickly realized is that we need to be able to put this into a real-time context for it to be useful, so it needs to be fast. And once we did that, we started realizing there are a lot of missing pieces, things like turn detection, response timing, picking up signals and orchestration.

And we started off in the beginning, we didn't know about PipeCat when we first built it, but we've been partnering with PipeCat over the last year, and it's come to our realization that like a lot of the stuff PipeCat does is going to be very important for conversational AI and making it real.

I think we can go to the next one. Yep. We have a demo. You can go to our site, Tavis.io. I was going to do it live, but for the sake of time, just check it out. You can go check it out on our website, Tavis.io. And I'll hand it back.

Well, no, there's one more thing. Yep. So what we do at Tavis now is we offer a conversational video interface. It is an end-to-end pipeline that allows you to have a conversation with a replica of anyone. You can create your own replica of yourself, you can put it online, and you can have a conversation.

The response time is around 600 milliseconds, but that's not ideal because a lot of times that's too fast. So we have to slow that down sometimes based on some of some of these models that we're using. And there are a lot of steps that go into this. You can see there's like, like Chad talked about, the basic layers of a conversational stack.

But we also have these proprietary models, Sparrow Zero and Raven Zero, that we've created, which is kind of like our IP or what we're offering. And we're going to, right now, we offer those in our stack. But we're moving towards a world where we're going to offer those in things like PipeCat.

So models, and we'll come back to the Tavis models in a little bit and how they are getting better and some of the cool new things that are coming from Tavis that you will want to use. Orchestration is where my world steps in. So that's PipeCat. That's the thing on my water bottle and my shirt and my jacket and all that kind of stuff.

Let's talk a little bit about what PipeCat is. There's a really interesting phrase on Brian's slide, real time observability and control into the flow of a conversation. That's a lot of those are words that you that don't really mean anything until you actually go build one of these things.

And when you build it, the first time you use it, go, wow, this is amazing. This is great. And then as you start to actually think about what it's going to mean to have that in production, you realize, oh, wait, there are a lot of like, boring infrastructure kinds of things that we need to solve, the ability to understand like to have observability into how the bot is behaving and why it's behaving that way, the ability to get capture metrics on things and understand things like, sometimes the bot takes a long time to respond, I wonder why that is.

Well, it turns out there's a whole lot of these kinds of things that you need for a real live production app for a real live production bot that you need something like PipeCat. PipeCat is an open source framework. It's built by my company, but it is open source and actually fully vendor neutral.

And it's designed to be this orchestration layer for real time AI. And by that, I mean, you are you have a user that is going to be producing either video and or audio. And you want to also be delivering video and or audio to that user. And you want to do that with a low latency as possible.

That's the real time part of this whole conversation. If you went to the AI engineer website, and you saw the little button on the bottom right that says talk to AIE, that's powered by PipeCat. It's actually using the Gemini live model. So it's using a voice to voice model.

But there's still so much other stuff you have to do to go from voice to voice demo bot on the web to like like in your browser or on the web to an actual like shipping production app that even Google themselves, even the Gemini documentation says, you can go use our own like tools and our you know, like our browser tools and things to experiment with Gemini multimodal life.

But when you want to take it to production, you do need something like PipeCat to actually orchestrate what's happening in your entire app. I'm going to try to do this side very quickly. And there are a few QR codes coming up. So now would be a good time to get those buttons ready.

PipeCat itself, two lists of three that you need to think about to understand what PipeCat does. The first one is something I just kind of already talked about. The three things that PipeCat is doing for you is handling input. It's handling the processing and the output. Input is receiving media from your user.

So in the case of a traditional voice bot, that's just voice. In the case of a Tavis replica, that's sending voice and they even they're doing some interesting things that we'll talk about with inputting your users video and allowing a Tavis replica to respond to not only what it's hearing in the voice, but what it's seeing in the video coming from the user.

Getting into that, that's the processing part. That's step two. That's where essentially you're going to run through a bunch of different models. In some cases, you can do almost all of what you need with a single model. In the case of like Gemini multimodal live for voice or a Tavis replica, there is a way that you use Tavis inside a PipeCat bot where you can basically let Tavis kind of do everything for you.

Run kind of as just one integrated piece. And then, of course, all of those models, hopefully, this is supposed to be real time and interactive video. Hopefully, those models are producing some kind of output that you want to show to your user. That's the video and the audio being produced by your tools.

In a typical voice bot, that is, you know, that is text to speech that is being played out as audio. It might also be things like UI updates. If you're in a web app that you're pushing UI updates, that kind of thing. And of course, in the Tavis case, it's video and audio that are hopefully presented in a way where the video stays synchronized to the audio, for example.

That's a really, really hard thing to do well, depending on exactly how you build this whole thing. The three fundamental pieces of the of PipeCat that enable those things to work are frames, processors, and pipelines. PipeCat's name comes from the fact that it is about building a pipeline and a pipeline is comprised or is composed of processors.

Processors are things that handle frames. Frames are essentially any, it's basically a type container for a kind of data. So, in a PipeCat pipeline, you will see a whole bunch of frames with things like little snippets of user audio, like 10 or 20 milliseconds of audio comes across as an audio frame or video frames from the user's camera device you can capture.

But even things like voice activity detection, VAD, comes across as a user started speaking frame in PipeCat. All of those frames progress through a series of processors and a processor just takes in some frames and outputs other frames. So, a good example would be like the LLM processor, for example, is taking in frames that are essentially context frames, like completed context turns from the user in the bot and it is outputting a stream of text frames.

So, if you're capturing streaming output from your LLM, in PipeCat that looks like a bunch of text frames coming out of that processor. And all those are put together in a pipeline and the pipeline is how you describe what you want your bot to do. And the idea behind how PipeCat runs your pipeline is that it's doing all of that stuff asynchronously and doing its best to minimize the latency of every piece of information as it goes through the pipeline.

So, there is a much better and longer explanation. I know that that was a lot. There's a much better and longer explanation in the PipeCat docs that is that QA file. In terms of what it actually looks like, it was going to be a little tight to try to get in and do some live coding during 15 minutes.

But this is a QR code that links to this example file. There's so much stuff in the PipeCat repo that shows you this. But just to step through these pieces real quick, at the top there's the transport input. This is the core pipeline inside this bot file. And this is actually one of the Tavis examples that we have in the repo.

First thing is transport input. That's where the frames come in from your media transport. So, whether it's WebRTC or WebSockets or Twilio WebSockets or anything like that, frames start pouring in from the transport input. They go to a speech-to-text processor. That's where transcription is happening. So, for example, one thing that frame processor is doing is it's collecting snippets of audio at, you know, a frame at a time, 20 milliseconds at a time.

But it is sort of up to your transcription processor, whatever that is, deep gram or whisper running on something or whatever, to, exactly, collect a bunch of frames, collect however many frames it needs to then output a snippet of transcription information, right? So that happens in speech-to-text. From there we go into something called the context aggregator.

That's because the transcription or the STT processor is emitting transcriptions whenever it feels like it. So we use other frames in the pipeline that have made their way through to understand, okay, the user has started talking. The user's microphone has, you know, microphone level has dropped. So it looks like the user has stopped talking.

Maybe now is a good time to group all of the various pieces of transcription we've gotten over the past few seconds together and emit a single context aggregation frame. That's what triggers the LLM to run. And so we grab the context. And if you, you know, of course, if you've programmed with the LLMs, you know you get the context with all the array of messages and the tools and everything.

You show that to the LLM and then it starts streaming tokens back. Those tokens come out of LLM as text frames as well as there's like a start and end frame. And if you, if you're familiar with this approach, you can probably see all these other frames as they start to exist in here.

But then TTS essentially accumulates those and generate speech. This bot file is actually an older example that uses an older Tavis model where we were actually generating audio. And then we were sending the audio over, I believe, a WebSocket, it's not important. We were sending audio to a Tavis model that was generating the video based on the audio and then sending back to us, back to PikeCat, audio and video.

So essentially the same audio, but synchronized with the video. Those come as a different series of frames that then go out the transport output. And that is, again, essentially the same transport that we're using on the import side, input side, but this is the output side. And so that's where all that media goes back to the other user.

So you can, you can start to see how with this structure, um, it looks very simple right here, but it is incredibly powerful when you realize that you can kind of put anything you want in this pipeline. Uh, we have people, for example, that like there's a construct in PikeCat called parallel pipelines.

And so we have people that have this exact same workflow, but at the same time in real time, they're running another LLM that is doing things like, um, you know, sentiment analysis, or we have, there's, there's, there's one, uh, PikeCat user I talked to that is using Gemini live multimodal to detect if the person answering the phone is a person or if it's a voicemail greeting, but they have a separate, they have separate pipelines running for whether it's a voicemail or whether it's a human.

And all that happens in PikeCat through the use of a parallel pipeline, run one model to determine, and then it sends a signal to the back to the pipeline to say, do the voicemail branch or do the human branch. So you can start to get a, get an idea of what you can build, even if you have a model like Tavis that is doing 90% of the hard work of making the actual interaction feel good.

There's just enough other stuff that's going to happen around the periphery that it just makes a lot of sense to wrap what you're doing inside something like PikeCat. This is what, so Brian showed a picture, this is that same Tavis avatar. Um, if you go to the QR code on the last slide, um, which is going to come up again in a second, um, you can basically run that example like you should need to sign up for Tavis, you get a key, you drop a key in there, you run that example code on modified and it will pop up this UI where you can both talk to that avatar in real time, talk to the replica in real time, but also you can see some of like the interesting guts of what's happening inside PikeCat in that debug panel over there.

Do you want to tell us a little bit about why this architecture is interesting and what we can do in the near future with it? Yeah. So as I mentioned, when we first built Tavis's conversational video, video interface, we built it ourselves because we didn't know about PikeCat. So we've spent the last year learning a lot of the lessons that PikeCat has already solved.

There are a ton of orchestration, aggregation, communication functionalities that are in PikeCat already that are going to basically save you months of time. I mean, it's going to save you a lot of time. So, um, when we first talked about having this talk, I was like, we're not using PikeCat internally.

I was like, I can't really say we're using it internally, but the thing is our customers that have come to us that are enterprise customers, they're using PikeCat and they want to be able to use our stuff in PikeCat. So now we're, we're getting ready to move our best models into PikeCat.

We've already moved Phoenix, which is our rendering model, but we're also going to be moving turn taking response, timing, perception models, things like that. And eventually we're going to made up and actually bring PikeCat internally as well because it, I spent like the last couple of days actually debugging a problem that pipe gets already solved really well.

And I don't want to have to do that anymore. Yeah. So I talked about these models that are coming. So we, we, we have a couple different, unique models. Our turn detection model is a multilingual model that determines when a person is done speaking. You wouldn't believe how important that is in a conversational AI.

It's going to make your AI faster and it's going to make it so it doesn't interrupt people simultaneously. If you, if you have a very fast, a conversational pipeline, oftentimes it will actually talk over the user. But, and if you have a slow one, it will take so long to respond that people will be like, is it broken?

You want to get the best of both worlds. And that's what turn detection does. We're also working on a response timing model right now. And that response timing, we're bringing all these to PikeCat soon. That response timing model will determine how quickly it should respond even though the person's done.

Because if I'm telling you about my, my, my grandmother who's like going into a, into a, into a home and she's sad, you're not going to want to like quickly respond to that. You, you want to think and take your time, right? But if we're having a chit chat, you want to be fast.

So that's what that's all about. And then finally, our multimodal perception is able to look at emotions, look at the surroundings, what the person's wearing. And also we'll be feeding that into the turn taking and the response timing so that we're, we're able to provide much more nuanced conversational flow.

So those things are coming to my point. And so, and so this is another example, um, I will tear through the last of these because we're, we are already out of time and that's my fault. This is another example showing essentially a different way that you can integrate Tavis into PikeCat.

And this is part of the flexibility. As they develop new models, there are going to be things that will run directly inside Tavis. There are things that you want to have a little bit of control. And so you just drop them into, into a slightly differently shaped pipeline and you can get your bot to actually do what you want to do.

Um, I will talk about step three, which is deployment extremely quickly. Um, there are a lot of different ways that you can ship these bots. PipeCat is I, sometimes I call it open source to a fault. I wish it had a little, a few more opinions on some things.

Um, really what you need is kind of two pieces. You need some kind of rest API to essentially, to allow your app, whatever your client app is, you need some kind of basic rest API to tell your app that, that a user wants to talk to a bot. And when that happens, you need something to relatively quickly spin up a new instance of your bot and connect it to that user.

And this is what essentially that's showing here. Um, and then you also need a thing we haven't talked about again, go read the docs is the transport layer. That's the, that's the hopefully web RTC part that actually moves the media back and forth. That's part of what your infrastructure is configuring.

You have a user that wants to use a bot, you need a, you need an API that can start a bot and get, and connect that bot to your user. The very short version of how, if you want to just solve this problem with a little bit of money, come talk to us at our booth because pipecat cloud is like, if you don't want to mess with Kubernetes and all that kind of stuff, if Kubernetes makes you, we used to have this thing in Heroku where it would replace Kubernetes with scare quotes around it in the Heroku Slack, which was fun.

Um, come to this talk. This is Mark, uh, one of my colleagues talking a lot more about pipecat cloud and how we solve the problems of deploying bots at scale and how you can either use pipecat cloud. But if you want to actually just do it yourself, this is where you can learn how to do that.

And that's our time. Thank you all very much.

Realtime Conversational Video with Pipecat and Tavus — Chad Bailey and Brian Johnson, Daily & Tavus

Transcript