Realtime Conversational Video with Pipecat and Tavus — Chad Bailey and Brian Johnson, Daily & Tavus

00:00:00.040 | We're here to talk about real-time conversational video with PipeCat, that's me, and with Tavis,

00:00:19.320 | that's Brian. We'll introduce ourselves a little bit more, but in the interest of keeping it moving,

00:00:22.600 | let's talk about what we're here for. Have any of you ever seen one of these robot concierge things?

00:00:28.080 | Do they work? No, they don't. They're terrible, right? It's actually possible nowadays to build this kind of thing,

00:00:34.380 | but actually good. It's a little bit tricky, but that's what we're here to show you how to do.

00:00:39.360 | There are three things you need to think about when you want to build real-time AI. The first is your

00:00:46.620 | models. Obviously, we all know what models are. That's why we're here at this conference. The thing

00:00:51.680 | that you don't necessarily know you need to think about is your orchestration layer. We're going to

00:00:56.220 | talk a little bit about that. Then, of course, you need to deploy these bots somewhere. That's the

00:01:00.420 | third step, deployment. We'll talk about that as well. Step one, models. I come from a little bit

00:01:08.040 | more of a traditional, funny to say that, voice AI world where the traditional pipeline people talk

00:01:13.560 | about is speech-to-text, so transcription, and then LLMs for your inference, and then text-to-speech.

00:01:19.620 | That's the typical kind of cascading pipeline you hear. Sure enough, people nowadays are using some

00:01:24.660 | voice-to-voice models. That is a use case for this kind of thing, but there are reasons sometimes you

00:01:28.500 | might use one or the other. Real-time video is a lot more complicated. It doesn't have to be,

00:01:35.760 | but it can be, and I think Brian will tell you that it should be. There's a lot more stuff you need to

00:01:40.440 | think about to do video generation in real-time. So, Brian, you want to tell us a little bit

00:01:46.620 | about Tavis and how you all are thinking about this? Thanks, Chad. So, Tavis started out as an AI

00:01:51.960 | research company, and we started off with a single model that was like a rendering model. What we

00:01:57.920 | quickly realized is that we need to be able to put this into a real-time context for it to be useful,

00:02:03.680 | so it needs to be fast. And once we did that, we started realizing there are a lot of missing pieces,

00:02:09.060 | things like turn detection, response timing, picking up signals and orchestration. And we started off in

00:02:16.980 | the beginning, we didn't know about PipeCat when we first built it, but we've been partnering with

00:02:21.420 | PipeCat over the last year, and it's come to our realization that like a lot of the stuff PipeCat

00:02:26.160 | does is going to be very important for conversational AI and making it real. I think we can go to the next

00:02:33.660 | one. Yep. We have a demo. You can go to our site, Tavis.io. I was going to do it live,

00:02:38.300 | but for the sake of time, just check it out. You can go check it out on our website, Tavis.io.

00:02:43.220 | And I'll hand it back. Well, no, there's one more thing. Yep. So what we do at Tavis now is we

00:02:50.220 | offer a conversational video interface. It is an end-to-end pipeline that allows you to have a

00:02:56.060 | conversation with a replica of anyone. You can create your own replica of yourself, you can put

00:03:01.180 | it online, and you can have a conversation. The response time is around 600 milliseconds, but that's not

00:03:06.920 | ideal because a lot of times that's too fast. So we have to slow that down sometimes based on some of

00:03:11.840 | some of these models that we're using. And there are a lot of steps that go into this. You can see

00:03:18.260 | there's like, like Chad talked about, the basic layers of a conversational stack. But we also have

00:03:25.360 | these proprietary models, Sparrow Zero and Raven Zero, that we've created, which is kind of like our

00:03:30.140 | IP or what we're offering. And we're going to, right now, we offer those in our stack. But we're

00:03:35.980 | moving towards a world where we're going to offer those in things like PipeCat.

00:03:38.900 | So models, and we'll come back to the Tavis models in a little bit and how they are getting better and

00:03:46.100 | some of the cool new things that are coming from Tavis that you will want to use. Orchestration is where

00:03:51.680 | my world steps in. So that's PipeCat. That's the thing on my water bottle and my shirt and my jacket and

00:03:57.980 | all that kind of stuff. Let's talk a little bit about what PipeCat is. There's a really

00:04:04.060 | interesting phrase on Brian's slide, real time observability and control into the flow of a

00:04:09.060 | conversation. That's a lot of those are words that you that don't really mean anything until you

00:04:16.540 | actually go build one of these things. And when you build it, the first time you use it, go, wow,

00:04:21.000 | this is amazing. This is great. And then as you start to actually think about what it's going to mean

00:04:24.640 | to have that in production, you realize, oh, wait, there are a lot of like, boring infrastructure

00:04:30.720 | kinds of things that we need to solve, the ability to understand like to have observability into

00:04:35.720 | how the bot is behaving and why it's behaving that way, the ability to get capture metrics

00:04:39.220 | on things and understand things like, sometimes the bot takes a long time to respond, I wonder

00:04:43.300 | why that is. Well, it turns out there's a whole lot of these kinds of things that you need for

00:04:47.920 | a real live production app for a real live production bot that you need something like

00:04:52.440 | PipeCat. PipeCat is an open source framework. It's built by my company, but it is open source

00:04:57.120 | and actually fully vendor neutral. And it's designed to be this orchestration layer for real time AI.

00:05:03.320 | And by that, I mean, you are you have a user that is going to be producing either video and

00:05:08.400 | or audio. And you want to also be delivering video and or audio to that user. And you want to do that

00:05:14.120 | with a low latency as possible. That's the real time part of this whole conversation.

00:05:17.720 | If you went to the AI engineer website, and you saw the little button on the bottom right that says

00:05:23.040 | talk to AIE, that's powered by PipeCat. It's actually using the Gemini live model. So it's using a voice

00:05:29.100 | to voice model. But there's still so much other stuff you have to do to go from voice to voice demo bot on

00:05:35.360 | the web to like like in your browser or on the web to an actual like shipping production app that even

00:05:42.080 | Google themselves, even the Gemini documentation says, you can go use our own like tools and our you

00:05:50.480 | know, like our browser tools and things to experiment with Gemini multimodal life. But when you want to

00:05:56.080 | take it to production, you do need something like PipeCat to actually orchestrate what's happening in your

00:06:00.800 | entire app. I'm going to try to do this side very quickly. And there are a few QR codes coming up. So

00:06:09.120 | now would be a good time to get those buttons ready. PipeCat itself, two lists of three that you need to

00:06:18.400 | think about to understand what PipeCat does. The first one is something I just kind of already talked

00:06:22.240 | about. The three things that PipeCat is doing for you is handling input. It's handling the processing

00:06:28.560 | and the output. Input is receiving media from your user. So in the case of a traditional voice bot,

00:06:35.200 | that's just voice. In the case of a Tavis replica, that's sending voice and they even they're doing

00:06:40.960 | some interesting things that we'll talk about with inputting your users video and allowing a Tavis replica

00:06:46.800 | to respond to not only what it's hearing in the voice, but what it's seeing in the video coming

00:06:51.680 | from the user. Getting into that, that's the processing part. That's step two. That's where

00:06:57.600 | essentially you're going to run through a bunch of different models. In some cases, you can do almost

00:07:02.480 | all of what you need with a single model. In the case of like Gemini multimodal live for voice or a

00:07:07.120 | Tavis replica, there is a way that you use Tavis inside a PipeCat bot where you can basically let

00:07:12.960 | Tavis kind of do everything for you. Run kind of as just one integrated piece. And then, of course,

00:07:19.040 | all of those models, hopefully, this is supposed to be real time and interactive video. Hopefully,

00:07:23.040 | those models are producing some kind of output that you want to show to your user. That's the video

00:07:27.360 | and the audio being produced by your tools. In a typical voice bot, that is, you know, that is

00:07:32.960 | text to speech that is being played out as audio. It might also be things like UI updates. If you're

00:07:38.240 | in a web app that you're pushing UI updates, that kind of thing. And of course, in the Tavis case,

00:07:41.840 | it's video and audio that are hopefully presented in a way where the video stays synchronized to the

00:07:46.640 | audio, for example. That's a really, really hard thing to do well, depending on exactly how you build

00:07:50.800 | this whole thing. The three fundamental pieces of the of PipeCat that enable those things to work

00:07:57.440 | are frames, processors, and pipelines. PipeCat's name comes from the fact that it is about building

00:08:02.800 | a pipeline and a pipeline is comprised or is composed of processors. Processors are things that handle

00:08:10.640 | frames. Frames are essentially any, it's basically a type container for a kind of data. So, in a PipeCat

00:08:18.640 | pipeline, you will see a whole bunch of frames with things like little snippets of user audio,

00:08:23.520 | like 10 or 20 milliseconds of audio comes across as an audio frame or video frames from the user's camera

00:08:29.760 | device you can capture. But even things like voice activity detection, VAD, comes across as a user

00:08:34.960 | started speaking frame in PipeCat. All of those frames progress through a series of processors and a

00:08:40.160 | processor just takes in some frames and outputs other frames. So, a good example would be like the LLM

00:08:47.680 | processor, for example, is taking in frames that are essentially context frames, like completed context

00:08:54.560 | turns from the user in the bot and it is outputting a stream of text frames. So, if you're capturing

00:09:00.560 | streaming output from your LLM, in PipeCat that looks like a bunch of text frames coming out of that

00:09:06.400 | processor. And all those are put together in a pipeline and the pipeline is how you describe what

00:09:11.760 | you want your bot to do. And the idea behind how PipeCat runs your pipeline is that it's doing all of

00:09:16.720 | that stuff asynchronously and doing its best to minimize the latency of every piece of information

00:09:22.720 | as it goes through the pipeline. So, there is a much better and longer explanation. I know that

00:09:28.640 | that was a lot. There's a much better and longer explanation in the PipeCat docs that is that QA file.

00:09:34.560 | In terms of what it actually looks like, it was going to be a little tight to try to get in and do some live

00:09:39.760 | coding during 15 minutes. But this is a QR code that links to this example file. There's so much stuff

00:09:46.800 | in the PipeCat repo that shows you this. But just to step through these pieces real quick, at the top

00:09:51.760 | there's the transport input. This is the core pipeline inside this bot file. And this is actually one of

00:09:57.360 | the Tavis examples that we have in the repo. First thing is transport input. That's where the frames come

00:10:03.040 | in from your media transport. So, whether it's WebRTC or WebSockets or Twilio WebSockets or anything

00:10:09.040 | like that, frames start pouring in from the transport input. They go to a speech-to-text processor. That's

00:10:14.560 | where transcription is happening. So, for example, one thing that frame processor is doing is it's

00:10:19.760 | collecting snippets of audio at, you know, a frame at a time, 20 milliseconds at a time. But it is sort of

00:10:25.200 | up to your transcription processor, whatever that is, deep gram or whisper running on something or whatever,

00:10:30.560 | to, exactly, collect a bunch of frames, collect however many frames it needs to then output a

00:10:35.680 | snippet of transcription information, right? So that happens in speech-to-text. From there we go into

00:10:40.160 | something called the context aggregator. That's because the transcription or the STT processor is

00:10:46.000 | emitting transcriptions whenever it feels like it. So we use other frames in the pipeline that have made

00:10:51.520 | their way through to understand, okay, the user has started talking. The user's microphone has, you know,

00:10:56.480 | microphone level has dropped. So it looks like the user has stopped talking.

00:10:59.760 | Maybe now is a good time to group all of the various pieces of transcription we've gotten over

00:11:03.840 | the past few seconds together and emit a single context aggregation frame. That's what triggers

00:11:09.360 | the LLM to run. And so we grab the context. And if you, you know, of course, if you've programmed with

00:11:14.640 | the LLMs, you know you get the context with all the array of messages and the tools and everything.

00:11:18.480 | You show that to the LLM and then it starts streaming tokens back. Those tokens come out of LLM

00:11:23.280 | as text frames as well as there's like a start and end frame. And if you, if you're familiar with this

00:11:27.840 | approach, you can probably see all these other frames as they start to exist in here. But then TTS

00:11:32.960 | essentially accumulates those and generate speech. This bot file is actually an older example that uses an

00:11:41.040 | older Tavis model where we were actually generating audio. And then we were sending the audio over, I believe,

00:11:47.520 | a WebSocket, it's not important. We were sending audio to a Tavis model that was generating the video

00:11:54.000 | based on the audio and then sending back to us, back to PikeCat, audio and video. So essentially the same

00:11:59.840 | audio, but synchronized with the video. Those come as a different series of frames that then go out the

00:12:04.240 | transport output. And that is, again, essentially the same transport that we're using on the import side,

00:12:09.600 | input side, but this is the output side. And so that's where all that media goes back to the other user.

00:12:14.320 | So you can, you can start to see how with this structure, um, it looks very simple right here,

00:12:20.640 | but it is incredibly powerful when you realize that you can kind of put anything you want in this

00:12:24.480 | pipeline. Uh, we have people, for example, that like there's a construct in PikeCat called parallel

00:12:29.280 | pipelines. And so we have people that have this exact same workflow, but at the same time in real time,

00:12:34.240 | they're running another LLM that is doing things like, um, you know, sentiment analysis, or we have,

00:12:40.880 | there's, there's, there's one, uh, PikeCat user I talked to that is using Gemini live multimodal to

00:12:46.960 | detect if the person answering the phone is a person or if it's a voicemail greeting,

00:12:52.960 | but they have a separate, they have separate pipelines running for whether it's a voicemail

00:12:56.880 | or whether it's a human. And all that happens in PikeCat through the use of a parallel pipeline,

00:13:00.480 | run one model to determine, and then it sends a signal to the back to the pipeline to say,

00:13:04.800 | do the voicemail branch or do the human branch. So you can start to get a, get an idea of what you

00:13:10.880 | can build, even if you have a model like Tavis that is doing 90% of the hard work of making the actual

00:13:17.440 | interaction feel good. There's just enough other stuff that's going to happen around the periphery

00:13:22.560 | that it just makes a lot of sense to wrap what you're doing inside something like PikeCat.

00:13:28.160 | This is what, so Brian showed a picture, this is that same Tavis avatar. Um, if you go to the QR code

00:13:35.280 | on the last slide, um, which is going to come up again in a second, um, you can basically run that

00:13:41.600 | example like you should need to sign up for Tavis, you get a key, you drop a key in there, you run that

00:13:45.440 | example code on modified and it will pop up this UI where you can both talk to that avatar in real time,

00:13:51.680 | talk to the replica in real time, but also you can see some of like the interesting guts of what's

00:13:56.080 | happening inside PikeCat in that debug panel over there. Do you want to tell us a little bit about

00:14:00.960 | why this architecture is interesting and what we can do in the near future with it?

00:14:05.920 | Yeah. So as I mentioned, when we first built Tavis's conversational video, video interface,

00:14:10.720 | we built it ourselves because we didn't know about PikeCat. So we've spent the last year learning a lot of

00:14:15.680 | the lessons that PikeCat has already solved. There are a ton of orchestration, aggregation,

00:14:24.640 | communication functionalities that are in PikeCat already that are going to basically save you

00:14:30.720 | months of time. I mean, it's going to save you a lot of time. So, um, when we first talked about having

00:14:36.960 | this talk, I was like, we're not using PikeCat internally. I was like, I can't really say we're

00:14:42.320 | using it internally, but the thing is our customers that have come to us that are enterprise customers,

00:14:46.960 | they're using PikeCat and they want to be able to use our stuff in PikeCat. So now we're, we're getting

00:14:52.880 | ready to move our best models into PikeCat. We've already moved Phoenix, which is our rendering model,

00:14:58.240 | but we're also going to be moving turn taking response, timing, perception models, things like that.

00:15:04.480 | And eventually we're going to made up and actually bring PikeCat internally as well because it, I spent

00:15:10.720 | like the last couple of days actually debugging a problem that pipe gets already solved really well.

00:15:14.960 | And I don't want to have to do that anymore.

00:15:16.240 | Yeah. So I talked about these models that are coming. So we, we, we have a couple different,

00:15:24.720 | unique models. Our turn detection model is a multilingual model that determines when a person

00:15:31.360 | is done speaking. You wouldn't believe how important that is in a conversational AI.

00:15:36.240 | It's going to make your AI faster and it's going to make it so it doesn't interrupt people

00:15:42.960 | simultaneously. If you, if you have a very fast, a conversational pipeline, oftentimes it will

00:15:50.800 | actually talk over the user. But, and if you have a slow one, it will take so long to respond that

00:15:56.400 | people will be like, is it broken? You want to get the best of both worlds. And that's what turn

00:15:59.920 | detection does. We're also working on a response timing model right now. And that response timing,

00:16:05.520 | we're bringing all these to PikeCat soon. That response timing model will determine how quickly it

00:16:10.240 | should respond even though the person's done. Because if I'm telling you about my, my, my grandmother

00:16:15.600 | who's like going into a, into a, into a home and she's sad, you're not going to want to like quickly

00:16:20.640 | respond to that. You, you want to think and take your time, right? But if we're having a chit chat, you

00:16:24.400 | want to be fast. So that's what that's all about. And then finally, our multimodal perception is able to

00:16:29.440 | look at emotions, look at the surroundings, what the person's wearing. And also we'll be feeding that into the turn taking

00:16:36.160 | and the response timing so that we're, we're able to provide much more nuanced conversational

00:16:43.520 | flow. So those things are coming to my point. And so, and so this is another example, um,

00:16:49.360 | I will tear through the last of these because we're, we are already out of time and that's my fault.

00:16:52.880 | This is another example showing essentially a different way that you can integrate Tavis into

00:16:56.640 | PikeCat. And this is part of the flexibility. As they develop new models, there are going to be things that

00:17:01.120 | will run directly inside Tavis. There are things that you want to have a little bit of control. And so you just drop

00:17:05.360 | them into, into a slightly differently shaped pipeline and you can get your bot to actually

00:17:09.600 | do what you want to do. Um, I will talk about step three, which is deployment extremely quickly.

00:17:14.960 | Um, there are a lot of different ways that you can ship these bots. PipeCat is I, sometimes I call it

00:17:20.000 | open source to a fault. I wish it had a little, a few more opinions on some things. Um, really what you

00:17:26.000 | need is kind of two pieces. You need some kind of rest API to essentially, to allow your app,

00:17:31.760 | whatever your client app is, you need some kind of basic rest API to tell your app that,

00:17:37.440 | that a user wants to talk to a bot. And when that happens, you need something to relatively quickly

00:17:42.240 | spin up a new instance of your bot and connect it to that user. And this is what essentially that's

00:17:46.480 | showing here. Um, and then you also need a thing we haven't talked about again, go read the docs is

00:17:51.200 | the transport layer. That's the, that's the hopefully web RTC part that actually moves the media back and

00:17:56.720 | forth. That's part of what your infrastructure is configuring. You have a user that wants to use a bot,

00:18:01.360 | you need a, you need an API that can start a bot and get, and connect that bot to your user.

00:18:06.640 | The very short version of how, if you want to just solve this problem with a little bit of money,

00:18:10.640 | come talk to us at our booth because pipecat cloud is like, if you don't want to mess with Kubernetes

00:18:14.800 | and all that kind of stuff, if Kubernetes makes you, we used to have this thing in Heroku where it would

00:18:18.800 | replace Kubernetes with scare quotes around it in the Heroku Slack, which was fun. Um, come to this talk.

00:18:23.680 | This is Mark, uh, one of my colleagues talking a lot more about pipecat cloud and how we solve the problems of

00:18:30.160 | deploying bots at scale and how you can either use pipecat cloud. But if you want to actually just do

00:18:34.240 | it yourself, this is where you can learn how to do that. And that's our time. Thank you all very much.