back to indexRealtime Conversational Video with Pipecat and Tavus — Chad Bailey and Brian Johnson, Daily & Tavus

00:00:00.040 |
We're here to talk about real-time conversational video with PipeCat, that's me, and with Tavis, 00:00:19.320 |
that's Brian. We'll introduce ourselves a little bit more, but in the interest of keeping it moving, 00:00:22.600 |
let's talk about what we're here for. Have any of you ever seen one of these robot concierge things? 00:00:28.080 |
Do they work? No, they don't. They're terrible, right? It's actually possible nowadays to build this kind of thing, 00:00:34.380 |
but actually good. It's a little bit tricky, but that's what we're here to show you how to do. 00:00:39.360 |
There are three things you need to think about when you want to build real-time AI. The first is your 00:00:46.620 |
models. Obviously, we all know what models are. That's why we're here at this conference. The thing 00:00:51.680 |
that you don't necessarily know you need to think about is your orchestration layer. We're going to 00:00:56.220 |
talk a little bit about that. Then, of course, you need to deploy these bots somewhere. That's the 00:01:00.420 |
third step, deployment. We'll talk about that as well. Step one, models. I come from a little bit 00:01:08.040 |
more of a traditional, funny to say that, voice AI world where the traditional pipeline people talk 00:01:13.560 |
about is speech-to-text, so transcription, and then LLMs for your inference, and then text-to-speech. 00:01:19.620 |
That's the typical kind of cascading pipeline you hear. Sure enough, people nowadays are using some 00:01:24.660 |
voice-to-voice models. That is a use case for this kind of thing, but there are reasons sometimes you 00:01:28.500 |
might use one or the other. Real-time video is a lot more complicated. It doesn't have to be, 00:01:35.760 |
but it can be, and I think Brian will tell you that it should be. There's a lot more stuff you need to 00:01:40.440 |
think about to do video generation in real-time. So, Brian, you want to tell us a little bit 00:01:46.620 |
about Tavis and how you all are thinking about this? Thanks, Chad. So, Tavis started out as an AI 00:01:51.960 |
research company, and we started off with a single model that was like a rendering model. What we 00:01:57.920 |
quickly realized is that we need to be able to put this into a real-time context for it to be useful, 00:02:03.680 |
so it needs to be fast. And once we did that, we started realizing there are a lot of missing pieces, 00:02:09.060 |
things like turn detection, response timing, picking up signals and orchestration. And we started off in 00:02:16.980 |
the beginning, we didn't know about PipeCat when we first built it, but we've been partnering with 00:02:21.420 |
PipeCat over the last year, and it's come to our realization that like a lot of the stuff PipeCat 00:02:26.160 |
does is going to be very important for conversational AI and making it real. I think we can go to the next 00:02:33.660 |
one. Yep. We have a demo. You can go to our site, Tavis.io. I was going to do it live, 00:02:38.300 |
but for the sake of time, just check it out. You can go check it out on our website, Tavis.io. 00:02:43.220 |
And I'll hand it back. Well, no, there's one more thing. Yep. So what we do at Tavis now is we 00:02:50.220 |
offer a conversational video interface. It is an end-to-end pipeline that allows you to have a 00:02:56.060 |
conversation with a replica of anyone. You can create your own replica of yourself, you can put 00:03:01.180 |
it online, and you can have a conversation. The response time is around 600 milliseconds, but that's not 00:03:06.920 |
ideal because a lot of times that's too fast. So we have to slow that down sometimes based on some of 00:03:11.840 |
some of these models that we're using. And there are a lot of steps that go into this. You can see 00:03:18.260 |
there's like, like Chad talked about, the basic layers of a conversational stack. But we also have 00:03:25.360 |
these proprietary models, Sparrow Zero and Raven Zero, that we've created, which is kind of like our 00:03:30.140 |
IP or what we're offering. And we're going to, right now, we offer those in our stack. But we're 00:03:35.980 |
moving towards a world where we're going to offer those in things like PipeCat. 00:03:38.900 |
So models, and we'll come back to the Tavis models in a little bit and how they are getting better and 00:03:46.100 |
some of the cool new things that are coming from Tavis that you will want to use. Orchestration is where 00:03:51.680 |
my world steps in. So that's PipeCat. That's the thing on my water bottle and my shirt and my jacket and 00:03:57.980 |
all that kind of stuff. Let's talk a little bit about what PipeCat is. There's a really 00:04:04.060 |
interesting phrase on Brian's slide, real time observability and control into the flow of a 00:04:09.060 |
conversation. That's a lot of those are words that you that don't really mean anything until you 00:04:16.540 |
actually go build one of these things. And when you build it, the first time you use it, go, wow, 00:04:21.000 |
this is amazing. This is great. And then as you start to actually think about what it's going to mean 00:04:24.640 |
to have that in production, you realize, oh, wait, there are a lot of like, boring infrastructure 00:04:30.720 |
kinds of things that we need to solve, the ability to understand like to have observability into 00:04:35.720 |
how the bot is behaving and why it's behaving that way, the ability to get capture metrics 00:04:39.220 |
on things and understand things like, sometimes the bot takes a long time to respond, I wonder 00:04:43.300 |
why that is. Well, it turns out there's a whole lot of these kinds of things that you need for 00:04:47.920 |
a real live production app for a real live production bot that you need something like 00:04:52.440 |
PipeCat. PipeCat is an open source framework. It's built by my company, but it is open source 00:04:57.120 |
and actually fully vendor neutral. And it's designed to be this orchestration layer for real time AI. 00:05:03.320 |
And by that, I mean, you are you have a user that is going to be producing either video and 00:05:08.400 |
or audio. And you want to also be delivering video and or audio to that user. And you want to do that 00:05:14.120 |
with a low latency as possible. That's the real time part of this whole conversation. 00:05:17.720 |
If you went to the AI engineer website, and you saw the little button on the bottom right that says 00:05:23.040 |
talk to AIE, that's powered by PipeCat. It's actually using the Gemini live model. So it's using a voice 00:05:29.100 |
to voice model. But there's still so much other stuff you have to do to go from voice to voice demo bot on 00:05:35.360 |
the web to like like in your browser or on the web to an actual like shipping production app that even 00:05:42.080 |
Google themselves, even the Gemini documentation says, you can go use our own like tools and our you 00:05:50.480 |
know, like our browser tools and things to experiment with Gemini multimodal life. But when you want to 00:05:56.080 |
take it to production, you do need something like PipeCat to actually orchestrate what's happening in your 00:06:00.800 |
entire app. I'm going to try to do this side very quickly. And there are a few QR codes coming up. So 00:06:09.120 |
now would be a good time to get those buttons ready. PipeCat itself, two lists of three that you need to 00:06:18.400 |
think about to understand what PipeCat does. The first one is something I just kind of already talked 00:06:22.240 |
about. The three things that PipeCat is doing for you is handling input. It's handling the processing 00:06:28.560 |
and the output. Input is receiving media from your user. So in the case of a traditional voice bot, 00:06:35.200 |
that's just voice. In the case of a Tavis replica, that's sending voice and they even they're doing 00:06:40.960 |
some interesting things that we'll talk about with inputting your users video and allowing a Tavis replica 00:06:46.800 |
to respond to not only what it's hearing in the voice, but what it's seeing in the video coming 00:06:51.680 |
from the user. Getting into that, that's the processing part. That's step two. That's where 00:06:57.600 |
essentially you're going to run through a bunch of different models. In some cases, you can do almost 00:07:02.480 |
all of what you need with a single model. In the case of like Gemini multimodal live for voice or a 00:07:07.120 |
Tavis replica, there is a way that you use Tavis inside a PipeCat bot where you can basically let 00:07:12.960 |
Tavis kind of do everything for you. Run kind of as just one integrated piece. And then, of course, 00:07:19.040 |
all of those models, hopefully, this is supposed to be real time and interactive video. Hopefully, 00:07:23.040 |
those models are producing some kind of output that you want to show to your user. That's the video 00:07:27.360 |
and the audio being produced by your tools. In a typical voice bot, that is, you know, that is 00:07:32.960 |
text to speech that is being played out as audio. It might also be things like UI updates. If you're 00:07:38.240 |
in a web app that you're pushing UI updates, that kind of thing. And of course, in the Tavis case, 00:07:41.840 |
it's video and audio that are hopefully presented in a way where the video stays synchronized to the 00:07:46.640 |
audio, for example. That's a really, really hard thing to do well, depending on exactly how you build 00:07:50.800 |
this whole thing. The three fundamental pieces of the of PipeCat that enable those things to work 00:07:57.440 |
are frames, processors, and pipelines. PipeCat's name comes from the fact that it is about building 00:08:02.800 |
a pipeline and a pipeline is comprised or is composed of processors. Processors are things that handle 00:08:10.640 |
frames. Frames are essentially any, it's basically a type container for a kind of data. So, in a PipeCat 00:08:18.640 |
pipeline, you will see a whole bunch of frames with things like little snippets of user audio, 00:08:23.520 |
like 10 or 20 milliseconds of audio comes across as an audio frame or video frames from the user's camera 00:08:29.760 |
device you can capture. But even things like voice activity detection, VAD, comes across as a user 00:08:34.960 |
started speaking frame in PipeCat. All of those frames progress through a series of processors and a 00:08:40.160 |
processor just takes in some frames and outputs other frames. So, a good example would be like the LLM 00:08:47.680 |
processor, for example, is taking in frames that are essentially context frames, like completed context 00:08:54.560 |
turns from the user in the bot and it is outputting a stream of text frames. So, if you're capturing 00:09:00.560 |
streaming output from your LLM, in PipeCat that looks like a bunch of text frames coming out of that 00:09:06.400 |
processor. And all those are put together in a pipeline and the pipeline is how you describe what 00:09:11.760 |
you want your bot to do. And the idea behind how PipeCat runs your pipeline is that it's doing all of 00:09:16.720 |
that stuff asynchronously and doing its best to minimize the latency of every piece of information 00:09:22.720 |
as it goes through the pipeline. So, there is a much better and longer explanation. I know that 00:09:28.640 |
that was a lot. There's a much better and longer explanation in the PipeCat docs that is that QA file. 00:09:34.560 |
In terms of what it actually looks like, it was going to be a little tight to try to get in and do some live 00:09:39.760 |
coding during 15 minutes. But this is a QR code that links to this example file. There's so much stuff 00:09:46.800 |
in the PipeCat repo that shows you this. But just to step through these pieces real quick, at the top 00:09:51.760 |
there's the transport input. This is the core pipeline inside this bot file. And this is actually one of 00:09:57.360 |
the Tavis examples that we have in the repo. First thing is transport input. That's where the frames come 00:10:03.040 |
in from your media transport. So, whether it's WebRTC or WebSockets or Twilio WebSockets or anything 00:10:09.040 |
like that, frames start pouring in from the transport input. They go to a speech-to-text processor. That's 00:10:14.560 |
where transcription is happening. So, for example, one thing that frame processor is doing is it's 00:10:19.760 |
collecting snippets of audio at, you know, a frame at a time, 20 milliseconds at a time. But it is sort of 00:10:25.200 |
up to your transcription processor, whatever that is, deep gram or whisper running on something or whatever, 00:10:30.560 |
to, exactly, collect a bunch of frames, collect however many frames it needs to then output a 00:10:35.680 |
snippet of transcription information, right? So that happens in speech-to-text. From there we go into 00:10:40.160 |
something called the context aggregator. That's because the transcription or the STT processor is 00:10:46.000 |
emitting transcriptions whenever it feels like it. So we use other frames in the pipeline that have made 00:10:51.520 |
their way through to understand, okay, the user has started talking. The user's microphone has, you know, 00:10:56.480 |
microphone level has dropped. So it looks like the user has stopped talking. 00:10:59.760 |
Maybe now is a good time to group all of the various pieces of transcription we've gotten over 00:11:03.840 |
the past few seconds together and emit a single context aggregation frame. That's what triggers 00:11:09.360 |
the LLM to run. And so we grab the context. And if you, you know, of course, if you've programmed with 00:11:14.640 |
the LLMs, you know you get the context with all the array of messages and the tools and everything. 00:11:18.480 |
You show that to the LLM and then it starts streaming tokens back. Those tokens come out of LLM 00:11:23.280 |
as text frames as well as there's like a start and end frame. And if you, if you're familiar with this 00:11:27.840 |
approach, you can probably see all these other frames as they start to exist in here. But then TTS 00:11:32.960 |
essentially accumulates those and generate speech. This bot file is actually an older example that uses an 00:11:41.040 |
older Tavis model where we were actually generating audio. And then we were sending the audio over, I believe, 00:11:47.520 |
a WebSocket, it's not important. We were sending audio to a Tavis model that was generating the video 00:11:54.000 |
based on the audio and then sending back to us, back to PikeCat, audio and video. So essentially the same 00:11:59.840 |
audio, but synchronized with the video. Those come as a different series of frames that then go out the 00:12:04.240 |
transport output. And that is, again, essentially the same transport that we're using on the import side, 00:12:09.600 |
input side, but this is the output side. And so that's where all that media goes back to the other user. 00:12:14.320 |
So you can, you can start to see how with this structure, um, it looks very simple right here, 00:12:20.640 |
but it is incredibly powerful when you realize that you can kind of put anything you want in this 00:12:24.480 |
pipeline. Uh, we have people, for example, that like there's a construct in PikeCat called parallel 00:12:29.280 |
pipelines. And so we have people that have this exact same workflow, but at the same time in real time, 00:12:34.240 |
they're running another LLM that is doing things like, um, you know, sentiment analysis, or we have, 00:12:40.880 |
there's, there's, there's one, uh, PikeCat user I talked to that is using Gemini live multimodal to 00:12:46.960 |
detect if the person answering the phone is a person or if it's a voicemail greeting, 00:12:52.960 |
but they have a separate, they have separate pipelines running for whether it's a voicemail 00:12:56.880 |
or whether it's a human. And all that happens in PikeCat through the use of a parallel pipeline, 00:13:00.480 |
run one model to determine, and then it sends a signal to the back to the pipeline to say, 00:13:04.800 |
do the voicemail branch or do the human branch. So you can start to get a, get an idea of what you 00:13:10.880 |
can build, even if you have a model like Tavis that is doing 90% of the hard work of making the actual 00:13:17.440 |
interaction feel good. There's just enough other stuff that's going to happen around the periphery 00:13:22.560 |
that it just makes a lot of sense to wrap what you're doing inside something like PikeCat. 00:13:28.160 |
This is what, so Brian showed a picture, this is that same Tavis avatar. Um, if you go to the QR code 00:13:35.280 |
on the last slide, um, which is going to come up again in a second, um, you can basically run that 00:13:41.600 |
example like you should need to sign up for Tavis, you get a key, you drop a key in there, you run that 00:13:45.440 |
example code on modified and it will pop up this UI where you can both talk to that avatar in real time, 00:13:51.680 |
talk to the replica in real time, but also you can see some of like the interesting guts of what's 00:13:56.080 |
happening inside PikeCat in that debug panel over there. Do you want to tell us a little bit about 00:14:00.960 |
why this architecture is interesting and what we can do in the near future with it? 00:14:05.920 |
Yeah. So as I mentioned, when we first built Tavis's conversational video, video interface, 00:14:10.720 |
we built it ourselves because we didn't know about PikeCat. So we've spent the last year learning a lot of 00:14:15.680 |
the lessons that PikeCat has already solved. There are a ton of orchestration, aggregation, 00:14:24.640 |
communication functionalities that are in PikeCat already that are going to basically save you 00:14:30.720 |
months of time. I mean, it's going to save you a lot of time. So, um, when we first talked about having 00:14:36.960 |
this talk, I was like, we're not using PikeCat internally. I was like, I can't really say we're 00:14:42.320 |
using it internally, but the thing is our customers that have come to us that are enterprise customers, 00:14:46.960 |
they're using PikeCat and they want to be able to use our stuff in PikeCat. So now we're, we're getting 00:14:52.880 |
ready to move our best models into PikeCat. We've already moved Phoenix, which is our rendering model, 00:14:58.240 |
but we're also going to be moving turn taking response, timing, perception models, things like that. 00:15:04.480 |
And eventually we're going to made up and actually bring PikeCat internally as well because it, I spent 00:15:10.720 |
like the last couple of days actually debugging a problem that pipe gets already solved really well. 00:15:16.240 |
Yeah. So I talked about these models that are coming. So we, we, we have a couple different, 00:15:24.720 |
unique models. Our turn detection model is a multilingual model that determines when a person 00:15:31.360 |
is done speaking. You wouldn't believe how important that is in a conversational AI. 00:15:36.240 |
It's going to make your AI faster and it's going to make it so it doesn't interrupt people 00:15:42.960 |
simultaneously. If you, if you have a very fast, a conversational pipeline, oftentimes it will 00:15:50.800 |
actually talk over the user. But, and if you have a slow one, it will take so long to respond that 00:15:56.400 |
people will be like, is it broken? You want to get the best of both worlds. And that's what turn 00:15:59.920 |
detection does. We're also working on a response timing model right now. And that response timing, 00:16:05.520 |
we're bringing all these to PikeCat soon. That response timing model will determine how quickly it 00:16:10.240 |
should respond even though the person's done. Because if I'm telling you about my, my, my grandmother 00:16:15.600 |
who's like going into a, into a, into a home and she's sad, you're not going to want to like quickly 00:16:20.640 |
respond to that. You, you want to think and take your time, right? But if we're having a chit chat, you 00:16:24.400 |
want to be fast. So that's what that's all about. And then finally, our multimodal perception is able to 00:16:29.440 |
look at emotions, look at the surroundings, what the person's wearing. And also we'll be feeding that into the turn taking 00:16:36.160 |
and the response timing so that we're, we're able to provide much more nuanced conversational 00:16:43.520 |
flow. So those things are coming to my point. And so, and so this is another example, um, 00:16:49.360 |
I will tear through the last of these because we're, we are already out of time and that's my fault. 00:16:52.880 |
This is another example showing essentially a different way that you can integrate Tavis into 00:16:56.640 |
PikeCat. And this is part of the flexibility. As they develop new models, there are going to be things that 00:17:01.120 |
will run directly inside Tavis. There are things that you want to have a little bit of control. And so you just drop 00:17:05.360 |
them into, into a slightly differently shaped pipeline and you can get your bot to actually 00:17:09.600 |
do what you want to do. Um, I will talk about step three, which is deployment extremely quickly. 00:17:14.960 |
Um, there are a lot of different ways that you can ship these bots. PipeCat is I, sometimes I call it 00:17:20.000 |
open source to a fault. I wish it had a little, a few more opinions on some things. Um, really what you 00:17:26.000 |
need is kind of two pieces. You need some kind of rest API to essentially, to allow your app, 00:17:31.760 |
whatever your client app is, you need some kind of basic rest API to tell your app that, 00:17:37.440 |
that a user wants to talk to a bot. And when that happens, you need something to relatively quickly 00:17:42.240 |
spin up a new instance of your bot and connect it to that user. And this is what essentially that's 00:17:46.480 |
showing here. Um, and then you also need a thing we haven't talked about again, go read the docs is 00:17:51.200 |
the transport layer. That's the, that's the hopefully web RTC part that actually moves the media back and 00:17:56.720 |
forth. That's part of what your infrastructure is configuring. You have a user that wants to use a bot, 00:18:01.360 |
you need a, you need an API that can start a bot and get, and connect that bot to your user. 00:18:06.640 |
The very short version of how, if you want to just solve this problem with a little bit of money, 00:18:10.640 |
come talk to us at our booth because pipecat cloud is like, if you don't want to mess with Kubernetes 00:18:14.800 |
and all that kind of stuff, if Kubernetes makes you, we used to have this thing in Heroku where it would 00:18:18.800 |
replace Kubernetes with scare quotes around it in the Heroku Slack, which was fun. Um, come to this talk. 00:18:23.680 |
This is Mark, uh, one of my colleagues talking a lot more about pipecat cloud and how we solve the problems of 00:18:30.160 |
deploying bots at scale and how you can either use pipecat cloud. But if you want to actually just do 00:18:34.240 |
it yourself, this is where you can learn how to do that. And that's our time. Thank you all very much.