back to indexFull Workshop: Realtime Voice AI — Mark Backman, Daily

00:00:00.000 |
I'm Mark with Daily. This is Alesh. And then we have a few other Daily folks: 00:00:20.240 |
Quinn, Nina, Varun, and then I'm not sure where he went but Philip from right there, 00:00:27.760 |
from the Google team, Google DeepMind team. So this session we're going to spend just a 00:00:32.320 |
few minutes getting everyone started. The idea here is going to be a hands-on 00:00:35.600 |
workshop where all the folks I just called out are going to be available to 00:00:39.820 |
help out. We'll walk you through a quick start to get you up and running and then 00:00:44.140 |
the idea is to build something. So build a voice bot in the next 78 minutes and 12 00:00:49.300 |
seconds or whatever time we have left. The one, I guess there's one consideration 00:00:55.520 |
is the Wi-Fi. If you don't have good Wi-Fi you might want to try to tether. I was 00:01:00.240 |
able to tether and it worked fairly well but the conference Wi-Fi was a little 00:01:03.280 |
shaky. This is real-time so you will be streaming data. It does require a viable 00:01:08.900 |
connection and not just sending a few bits over. So just a heads up if you hit 00:01:13.040 |
that as a snag. So I guess before I get started, who here knows about PipeCat or 00:01:19.760 |
has built anything with voice AI? Okay. A smaller audience. Has anyone built any real-time 00:01:27.800 |
applications with LLMs or AI? Maybe slightly bigger? Okay. Great. So PipeCat is a, it's a open-source repo. It's a Python framework for building voice AI. 00:01:31.040 |
voice and AI multimodal agents. And it's built by the team at Daly. But we're an open-source, it's an open-source project that anyone can contribute to. 00:01:49.760 |
It's been around for, I don't know, just over a year now? Yeah, like I would say officially PipeCat was March 2024, something like that. Okay. So 13 months, there we go. So just a quick walk through, maybe just to kind of ground everyone in the thinking around voice AI. These slides weren't built for this talk, but I'm going to use them. 00:02:13.760 |
So the, you know, voice AI or real-time applications are tough because there's just, you know, we as humans communicate all the time with each other, thousands, tens of thousands of years of evolution baked into our brains. So it's pretty tough to make a machine, you know, work on the same level. So we have great expectations being the user in it. So, you know, you need a good listener, some of those smart and conversational. You need to be connected to data stores. 00:02:38.760 |
It has to sound normal or natural. Think back to even just maybe two, three years ago, what your voice bots sound like, and many of them, if you call them on the phone, still sound like it needs to sound natural. And actually, kudos to the Google team, the latest Gemini Live native audio dialogue is quite good in that regard. It has to be fast. So the whole end-to-end communication needs to happen. And roughly, you know, kind of the benchmark is around 800 milliseconds. 00:03:07.760 |
You could strive for more. I think we see maybe on the human level, it might be 500 milliseconds or somewhere on that order. So it is pretty fast. So there's a lot to kind of to get all the way there. And this is something that we at Daily, everyone building PipeCat has been working very, very hard on getting all the way to meeting all these expectations. 00:03:28.760 |
So just to kind of ground you in some of this, since we're going to be working in PipeCat. PipeCat has a pipeline. 00:03:33.760 |
I don't know if maybe Alessio, you want to talk a little bit about the origin of that quickly. Sure, sure. You can think about it as a multimedia pipeline. And you would think what is the multimedia pipeline is basically just think about like boxes that receive input and input could be audio or video. And then those boxes just will stream those same data or modify data or new data to the following, 00:04:00.760 |
to the following elements or processors in PipeCat. Well, we call them processors. So in PipeCat, you would have a pipeline where you have a transport, which is the 00:04:12.760 |
the transport of your data or the input of your data. For example, when you're talking, you could be talking in a meeting. So that would be the audio of the user. Then you would have another box following that, which is the speech to text service. So the speech to text service would be the audio of the user. 00:04:22.760 |
It would transcribe it, then you would get text. That would be the following data that goes through the pipeline. And then the next one would be the LLM. So now the LLM has what the user has said. 00:04:34.760 |
And then it generates output, whatever, whatever the LLM that would be tokens, then those tokens are converted into text to speech. And then the text to speech outputs audio and then the audio goes back to the transport so you could hear what the LLM has said. 00:04:50.760 |
Today, what we're going to do today with Gemini Live, a lot of those boxes go away because the LLM will do a bunch of these things. It will do transcription, it will do the LLM, it will do the text to speech in one of these boxes. 00:05:11.760 |
But you might still require, for example, if you want to save the audio, record the audio into a file, you need a bunch of utilities to do that. And pipecat has all that built for you. So basically, that's it. 00:05:32.760 |
Yeah. I mean, a lot of this is really just its orchestration. So if you think about what pipecat offers, it's orchestration. It also offers a lot of abstractions for a lot of common utilities, like Alesh had said. 00:05:44.760 |
So recording, transcript outputs, artifacts you might want to produce, or even ways that you might manipulate the information in the pipeline itself. So this image here, which is what Alesh actually just talked through, is what you would call, I guess, a cascaded model, where you have this flow through of information. 00:06:01.760 |
So we're going to be, you can build with Google and many different services in this way. In the last year, there's been an emergence of speech to speech models that now take audio in natively and audio out natively. 00:06:12.760 |
And those models also allow for audio in, and then optionally, text and/or audio out. So you can actually, for example, take a raw, you know, microphone input, or, you know, audio input, and then the model would run all of its logic. 00:06:30.760 |
So you can actually opt to have it output text, if you want to say, parse the text output before speaking. So there are a few different demos we'll look at that offer that. 00:06:38.760 |
And in pipecat, we show kind of all the ways to do things, because that's what we offer, or at least what pipecat offers as a value proposition. So the-- 00:06:47.760 |
Yeah, just one thing. I don't think you'll mention it in the slides, but all these boxes, you can pluck and play the service you want in pipecat. So the speech to speech, speech to text, it could be, I don't know, Deepgram, for example, the LLM could be Google, or OpenAI, or whatever. You can just pluck and play any service you want. 00:07:07.760 |
Right, yeah. The modularity, I guess, is the other big strength. So there's no, you know, you can change out a service without changing out your underlying application code, which makes it easy. 00:07:15.760 |
And we see with this, a lot of companies that are building for voice AI might have maybe even a more complex thing. A pipeline here runs straight down, but you can actually have split branches where you might have one leg that's running some logic and the other running different. We call that a parallel pipeline. 00:07:31.760 |
So if you wanted to have, say, a failover, if vendor A goes down, you can move to vendor B dynamically, even within the same conversation. That's something that pipecat affords as well. And that can allow you to transfer context over. So a lot of really cool stuff. 00:07:47.760 |
The goal again today being to get you familiar with just building a voice agent and building one to get started. So one of the cool things, like Alesh had pointed out, that 00:08:00.760 |
with the cascaded models, there's a lot of complexity, but with your speech to speech model, things get dramatically simplified. 00:08:07.760 |
You know, your code may have looked something like this. This is like an old example of some like a ton of orchestration in the pipeline. But with a speech to speech model, you may be able to simplify it down to this, but then you have to remember, you actually need orchestration around it. So it does get simpler to some regard. So it's more about the services you interface with. 00:08:29.760 |
I think with that, why don't we transition now because I'm realizing we have only about maybe 70, 70, 75 minutes left to looking at the actual activity for today. 00:08:40.760 |
All right, so there's a public repo. I don't know how big or small this is, but it's under daily co. So on get out daily co daily dash co slash gemini dash pipecat dash workshop. 00:08:53.760 |
I'll give everyone a chance to give everyone a chance to make sure internet's working. 00:09:31.760 |
So what I want to do, so I spent a little bit of time, Elish and I spent time writing up this repo. 00:09:36.760 |
This is meant to be just a jumping off point. 00:09:38.760 |
I'm going to get you oriented and then I want to look through one of the bot files, which is kind of the main pipecat code with you. 00:09:45.760 |
And then we'll break and make this an interactive session where we can answer a bunch of questions. 00:09:50.760 |
So in the repo, you could either start doing it now or maybe take a pause, but this will give you the steps to walk through getting the quick start running. 00:10:00.760 |
Before we do that, I want to take a moment here. 00:10:12.760 |
Well, hey, you know, it is tough to do Wi-Fi for this many people. 00:10:19.760 |
Instead of real time, it's going to be real slow. 00:10:38.760 |
There will be some client code options, which we'll look at in a second. 00:10:43.760 |
Just to orient you, I'll just jump right into the meat of the pipeline. 00:10:47.760 |
We have this main function that runs your bot. 00:10:52.760 |
Everything is going to run kind of encapsulated within an AIo HTTP session. 00:10:58.760 |
That's more of just kind of the mechanics of things. 00:11:00.760 |
In our pipeline, let's just jump to the simple part here. 00:11:04.760 |
We'll have daily as the transport, daily as we are a WebRTC provider as well as we build 00:11:13.760 |
So one important note, when you speak, every turn of the bot is like a discrete point in time. 00:11:21.760 |
And this is maybe less so the case for a speech-to-speech model. 00:11:24.760 |
But for basic LLMs, they get discrete inputs. 00:11:29.760 |
So you're going to get a snapshot of the conversation. 00:11:31.760 |
The context aggregator is going to collect all the bits of the conversation, both from the user and the assistant. 00:11:39.760 |
And we'll put that into the form of what the LLMs can handle. 00:11:43.760 |
So this, in this case, is more for function calling and kind of logistics and management. 00:11:50.760 |
Gemini is amazing because it offers a lot of this for you. 00:11:53.760 |
But if you're to build with, say, the -- just build with Gemini not live, the actual kind of just the text-based LLM, 00:12:03.760 |
That will then go to your LLM, which is going to be Gemini live, Gemini multimodal live. 00:12:08.760 |
And then it's going to be outputted through daily again on that side of the transport. 00:12:16.760 |
We'd configure our service, which takes a number of arguments. 00:12:19.760 |
Like you set up a room with a token, give it a name, and then have some properties. 00:12:29.760 |
There's also a Gemini multimodal live LLM service, which is a pipecat class that is a wrapper around the Gemini live API. 00:12:42.760 |
With the LLM, you see we do a few special things. 00:12:47.760 |
This one has just two really basic kind of canned functions. 00:12:50.760 |
Fortunately, we're not calling out to the internet because it's not working very well. 00:12:56.760 |
So this one just has the dummy, like fetch weather, and we'll give you a restaurant recommendation. 00:13:00.760 |
So these are two handlers that when your function is called, we'll just return this result information. 00:13:07.760 |
So we have the actual functions themselves that are defined in this function schema, which is -- you can use just native function definitions using whatever LLM format. 00:13:19.760 |
We also created this function schema, which is a universal schema that lets you define and move between any LLM without having to kind of transform your LLM calls from OpenAI to Anthropic to Gemini to Bedrock. 00:13:33.760 |
Because they're all a little bit different or Grock. 00:13:35.760 |
You know, they all have slightly different formats. 00:13:37.760 |
So this is more of kind of a universal transform for that. 00:13:40.760 |
And then they're collected and translated into the native format in this tool schema. 00:13:47.760 |
So we'll pass the tools then to the Gemini service, and that's how it gets access to use and run those tools. 00:13:54.760 |
There's also a prompt above, which I think in this simple example, we just say, hey, you're a chat bot. 00:13:59.760 |
You have these tools available, and that's that. 00:14:02.760 |
We're also setting up our context aggregation, which, for better or worse, we use OpenAI as kind of the default, like the lingua franca for context. 00:14:13.760 |
So everything gets kind of folded back into OpenAI at a certain level. 00:14:20.760 |
So the pipeline is, again, like Elesh had said, just a list of all of your different -- or I guess a tuple of all of your different services that are running in the pipeline. 00:14:32.760 |
So if you want to instead make the LLM output text and you want to either extract information from them, maybe you have it in code, some XML or some type of information, 00:14:44.760 |
Maybe your application does something with it. 00:14:47.760 |
Or maybe you want to inject a text-to-speech type of frame. 00:14:52.760 |
You can actually do that by separating the LLM, that audio output, from audio output to just be text. 00:14:59.760 |
And then you would add, like, a text-to-speech service here. 00:15:02.760 |
Or you could write your own processor, which may be not within the next hour. 00:15:05.760 |
And then lastly, all of these -- or not all, but many of them have events they emit. 00:15:10.760 |
The transport emits handlers for when the client connects and disconnects. 00:15:15.760 |
So in this case, we use this line here to actually inject a context frame into Gemini to kick off the conversation. 00:15:24.760 |
So when your client application connects, it's going to queue a frame. 00:15:28.760 |
So again, frames being kind of the base format for information. 00:15:33.760 |
Think of it as like an object for your pipeline. 00:15:37.760 |
And what this function does is it just grabs the latest context. 00:15:41.760 |
So when we set this one above, I think that just says hello, that's going to pass that into -- 00:15:47.760 |
basically push that into the pipeline, which then it will make its way to Gemini to initialize the conversation. 00:15:53.760 |
And the rest of this, you could think of as boilerplate to run it. 00:15:56.760 |
You create a runner, which then is the thing that actually runs your task. 00:16:02.760 |
So maybe beyond today, but just know that that's something that's required to run your code. 00:16:08.760 |
I'm going to pause here just for any questions, because I do want to get to developing soon. 00:16:12.760 |
So you said with the web app, you can provide a web app, you can provide a web app. 00:16:24.760 |
The short answer is if you're building a client-server app, you should, with strong emphasis, use WebRTC. 00:16:31.760 |
It has a whole bunch of properties that are relevant, like error correction, better audio quality, et cetera, et cetera. 00:16:38.760 |
If you're building server-to-server, so one of the options today is to build a phone chatbot, you can use -- 00:16:43.760 |
and it's probably the best option to use WebSockets. 00:16:48.760 |
We actually -- in PipeCat, there's a fast API version of that that's a server that you can use to exchange messages with a WebSocket. 00:16:57.760 |
So I guess maybe the takeaway is if you're building client-server, you really want WebRTC, you could technically use WebSockets, but you'll hit like a long tail of errors when you get to production. 00:17:14.760 |
Could anyone download the repo by any chance? 00:17:37.760 |
You know, I'm not really -- you know, it's not my thing, but I could try. 00:17:41.760 |
Yeah, I guess I could just make this code walkthrough. 00:17:57.760 |
In fact, if you're using a speech-to-text service, that likely also brings its own VAD to the equation. 00:18:03.760 |
What we've found is that -- so maybe to use some of our extra time because we're having internet issues -- 00:18:08.760 |
the VAD serves a really important purpose of detecting when a user starts speaking. 00:18:13.760 |
So in the whole kind of life cycle of a turn, that user speaking kind of ushers in the user's turn for the conversation. 00:18:20.760 |
So PipeCat will emit a user started speaking frame, and that will also push through an interruption. 00:18:26.760 |
So the user will interrupt, like, anything that's talking to -- if the bot was speaking or whatever. 00:18:31.760 |
It basically clears the way because the user has expressed that they want to speak. 00:18:36.760 |
The idea with the VAD is that we want it to be extremely accurate and extremely fast. 00:18:41.760 |
So running something on-device -- we recommend Solero, which is an open-source option. 00:18:47.760 |
I don't know what the inference time is like -- I don't know -- millisecond -- extremely fast. 00:18:55.760 |
In fact, you have the ability to tune how long to hear human speech before that event gets emitted. 00:19:01.760 |
And the defaults are pretty good in PipeCat, and there are maybe scenarios where you want to change that. 00:19:05.760 |
But the VAD is a really important consideration. 00:19:11.760 |
Quinn has a great spreadsheet of breaking down the full cost analysis of an agent. 00:19:17.760 |
And really, the CPU is going to be extremely low. 00:19:19.760 |
Interestingly, the TTS tokens or characters are the most expensive by far. 00:19:24.760 |
So when you think about it, running that local VAD gives you superior performance. 00:19:29.760 |
And it allows you -- and there's not much of a cost to it. 00:19:32.760 |
I mean, it's maybe like a fraction of 1% to run a local VAD. 00:19:49.760 |
So there are -- I found out -- I didn't work much with phone. 00:19:56.760 |
I don't -- Varun is like a figure in the WebRTC community. 00:20:01.760 |
But he's also a phone expert, which I don't know if that crosses over too much. 00:20:07.760 |
Phones are super complicated, how you actually make calls. 00:20:17.760 |
You can make a WebSocket connection with a phone provider like a Twilio 00:20:21.760 |
or Telnyx or Plevo or Exotel and exchange media streams. 00:20:27.760 |
So that's a way to have just a native WebSocket connection from PipeCat to Twilio. 00:20:33.760 |
It's going to emit the WebSocket and there'll be a handshake to get connected. 00:20:37.760 |
You can also use PSTN, which is a public switch telephone network, which lets you dial in. 00:20:45.760 |
And that's going to be kind of a different mechanism. 00:20:47.760 |
There's also SIP, which is its own separate thing. 00:20:50.760 |
Again, and all of the telephony providers would also support this as well. 00:20:54.760 |
With SIP, you would call something, say, like Twilio. 00:20:58.760 |
You would call into, say, a server or a SIP provider like Daily, which offers SIP provided rooms. 00:21:06.760 |
And you then have the ability to kind of bring the two together via that SIP connection. 00:21:10.760 |
The nice thing about SIP is that you have the ability to have superior call control. 00:21:15.760 |
It is slightly more complicated, whereas that WebSocket connection is instantaneous. 00:21:22.760 |
So there's a whole, we're not going to talk about it today, but like cold starts for agents. 00:21:27.760 |
So if you don't have resources provisioned, you don't want your users waiting like 20 seconds 00:21:33.760 |
So a long-ish, medium-ish answer for a very complicated question. 00:21:50.760 |
Well, I think you were asking if PyCAD compares to, or how it compares to Cartesia. 00:22:09.760 |
Yeah, Cartesia is, well, I think you're asking if PyCAD compares to, or how it compares to Cartesia. 00:22:10.760 |
Yeah, Cartesia is, well, I think they're going to do more stuff. 00:22:14.760 |
But as of today, it's a text-to-speech service. 00:22:18.760 |
And you can clone your voice, or you can, you know, and then they provide a real-time API. 00:22:24.760 |
You can just via WebSockets, you pass the text, and they reply with audio, basically, with audio frames. 00:22:31.760 |
And then PyCAD integrates with Cartesia as any other text-to-speech service, like 11 labs or anything. 00:22:38.760 |
Think about PyCAD as a, it's just a framework for developers where they can plug and play the service they want. 00:22:46.760 |
So they can plug and, they can take Cartesia, they can put Cartesia, they can take it out, put 11 labs. 00:22:56.760 |
Yeah, you can plug and play any service you want. 00:23:00.760 |
Like, you can change LLM, you can use Llama, you can use Anthropic, you can use Cartesia or 11 labs. 00:23:08.760 |
Then for speech-to-text you can use Deepgram, or you can use one box, like Gemini Live, which has all that built for you. 00:23:28.760 |
One way, I mean, we're talking about, what is the product, right? 00:23:42.760 |
So, what about this, how are you, like, talking about, what about, right? 00:23:56.760 |
Like, you can say, I don't know, can I see that the model is going to be small, something that is correct to the platform, and the, or, and the user, say something that could not be, you know, like, a version or whatever, right? 00:24:17.760 |
Are you talking, like, how to ensure that the LLM says what the right thing, or not the right thing? 00:24:28.760 |
Yeah, well, PipeCAD doesn't have control about that. 00:24:31.760 |
It's up to you to define the prompt or define how the LLM will reply. 00:24:40.760 |
You can put, if you want, you can put, you will be able to write your own processors, that we call them, those little boxes, and check what the LLM has said, for example, before it's, like, put some kind of real-time eval, like, to make sure that the LLM has, you could, you could do that, yeah. 00:25:01.760 |
All that, that pipeline is very flexible, so you can put whatever you want there, like, in parallel, not in parallel. 00:25:10.760 |
For example, Mark was saying about parallel pipelines, like, if you have video and audio at the same time, with Gemini Live you can do everything in one box, but let's say you don't have Gemini Live, you want to use other services, that one does video and the other one does audio. 00:25:28.760 |
So, you know, you can have a parallel pipeline, which, you know, it's like a tree, right? 00:25:32.760 |
You have your transfer input, and then if it's audio, it goes this way, if it's video, it goes that way, and you can, you can do things dynamically like that, yeah. 00:25:46.760 |
Is that common that people put some kind of check in place, and you see, like, what is the actual latency of what that produces, and then for a role of, like, Gemini Live, that produces audio, it seems like, would it also produce the text with it, or do you have to kind of do speech-to-text again, and then back to it? 00:26:12.760 |
So the question was, are guardrails requirements, and how do people use them, and how does that apply for speech-to-speech models? 00:26:24.760 |
So the answer is, they're not required, and there is a challenge here, and actually, I talked about this on, like, one of the first slides. 00:26:32.760 |
Latency is absolutely critical, so what you want to avoid are unnecessary turns. 00:26:36.760 |
You know, obviously, LLMs are amazing language processors, so if you had all the time in the world, you could do hallucination checking against the LLM. 00:26:47.760 |
One of the big things is that we see, because of the aggregate nature of the context, it grows over the course of the conversation. 00:26:54.760 |
Actually, you can find better accuracy with the responses if you have more control over how you prompt the LLM. 00:27:06.760 |
What we found is there are two ways to handle this. 00:27:11.760 |
One, if you, for a lot of conversations, they're going to be task-oriented. 00:27:15.760 |
So let's say something simple like a restaurant reservation bot. 00:27:19.760 |
It may have to take your name, get your time, log the time to a database. 00:27:23.760 |
You can chunk that out, even that small conversation, into just discrete tasks, and LLMs are really good at following the most recent input. 00:27:32.760 |
So if you kind of feed it task by task, that helps. 00:27:36.760 |
Also, if you control the context window, like the size, it can really be beneficial to kind of manage that really judiciously. 00:27:45.760 |
One example might be, let's say you're building a patient intake bot at a doctor's office. 00:27:50.760 |
The very first thing it may do is verify the date of birth, which serves no utility beyond just the very first, you know, checkpoint. 00:27:57.760 |
So you may actually remove that from the context, like completely get it out of there. 00:28:01.760 |
Because otherwise, it's just cruft that hangs on. 00:28:04.760 |
And instead, you kind of reset and then maybe roll through the tasks. 00:28:07.760 |
You could also, for really, really long conversations, summarize the context. 00:28:11.760 |
So you may want to do an out-of-band LLM call. 00:28:14.760 |
And this is something actually, Quinn, just that we talked internally about this, that we're going to see more and more of this mixture of LLMs where, even in the context of real time, you may have an out-of-band, like, REST call to the text-based LLM just to do a summary and then return it back so that you can kind of compress that context window. 00:28:32.760 |
And just to give a call out to Google, the live API, so maybe transitioning there, they offer context management through a bunch of different strategies. 00:28:41.760 |
Like a rolling, they have a rolling window or sliding window. 00:28:44.760 |
I think they offer, like, token caps for that so you can have some control. 00:28:47.760 |
Or, if you want, you can output text and then kind of do whatever you want with it. 00:28:51.760 |
They also take text input, so there's a lot of flexibility with speech-to-speech models. 00:28:56.760 |
They do offer, or they do pose some other maybe development challenges, but offer, like, tremendous benefits in terms of the features they offer. 00:29:06.760 |
I'm going to hold questions just for a sec, because I've heard the Wi-Fi is back. 00:29:19.760 |
Workshop voice Gemini pipe cat with dashes in between on AI engineer Slack. 00:29:31.760 |
Workshop dash voice dash Gemini dash pipe cat is the channel name. 00:29:35.760 |
So, if you go to the AI engineer Slack and search for workshop Gemini, it should come up. 00:29:40.760 |
And Quinn will be posting links as we go along if the Wi-Fi stays on. 00:29:46.760 |
If anyone can get on that channel, can you raise your hand? 00:29:50.760 |
Can people, like, do they have, do you guys have access to the Slack, AI engineer Slack? 00:30:08.760 |
So, we'll see what that happened when the Wi-Fi was down and we were trying to create it. 00:30:15.760 |
On your batch, there should be the join the Slack group. 00:30:24.760 |
I'd like to walk through the quick start and then we can look at some examples. 00:30:32.760 |
I mean, there's obviously massive latency benefits because you cut out network round trips all over 00:30:38.760 |
And depending on where you are in the world, that can save a lot. 00:30:39.760 |
If you're US-based, you know, your network latency is going to be relatively low. 00:30:43.760 |
But a lot of the developers in the community are in Europe and a lot of these AI services 00:30:48.760 |
are relatively new with, you know, data centers only in the US. 00:30:52.760 |
So, there are different challenges when doing that, though, when running it. 00:31:08.760 |
Actually, there are great hosting providers like Modal that offer really good options for 00:31:12.760 |
running, like your own local LLM, which then you're not, you know, you're buying or I guess 00:31:17.760 |
leasing like the GPU time instead of running everything on a GPU, which would probably be 00:31:22.760 |
cost prohibitive because of the way processes run. 00:31:47.760 |
Well, the large context windows definitely cause LLMs to... 00:31:59.760 |
The question was around state management with LLMs and whether it's better to, I guess, chunk 00:32:08.760 |
or have more kind of deterministic input versus just a large context where you just dump everything 00:32:14.760 |
This is actually an extension of what the other gentleman was asking. 00:32:20.760 |
And actually, so, daily we built the chat widget that's on the homepage of the voice AI 00:32:33.760 |
And Alesh and I were just talking about this. 00:32:35.760 |
Function calls in the context of real time are still slow, unfortunately. 00:32:42.760 |
I'll give more props to Gemini being like maybe one of the fastest. 00:32:45.760 |
If you run like a basic local, like one of these demos and you ask it what the weather is, it will return back with time to first bite in, I don't know, less than 500 milliseconds. 00:32:54.760 |
Whereas other vendors, not trying to throw open AI under the bus, but it has gotten slower. 00:33:00.760 |
You might see upwards of 1.5 to 2 seconds of waiting just to get that first token back. 00:33:06.760 |
Actually, this is just something recently this morning. 00:33:08.760 |
The issue is when you get the normal streamed response for the conversation, you can start playing that audio out once you get the first sentence. 00:33:14.760 |
The issue being when you get a tool, you need the entire JSON response before you can actually do anything with it. 00:33:23.760 |
Separately, chunking the prompts is absolutely the way to go. 00:33:26.760 |
And building that world's fair bot, I was kind of balancing between the two worlds because you can talk to it and ask it about speakers for the session. 00:33:36.760 |
Route 1 would have said, let's use like a mag approach and put all of the speaker JSON in something that could be a tool that could be accessed. 00:33:45.760 |
And I tried that and unfortunately, it's just a little too slow. 00:33:50.760 |
What's interesting is if you instead move that all just directly into the context with Gemini Live. 00:33:56.760 |
It's a little bit variable, but under good conditions, you'll get a response back on that like 800 millisecond latency. 00:34:03.760 |
So it actually has access to like the full context. 00:34:05.760 |
The one trade-off though is what this gentleman over here was asking is accuracy. 00:34:10.760 |
It's going to get confused, especially when you get a JSON with a lot of speaker. 00:34:16.760 |
With that type of, even structured data becomes very hard to kind of discern what's what when a lot of it looks the same. 00:34:22.760 |
So it's all, I mean, a lot of this is emerging like other things in AI, trying to just do it as fast as possible with voice. 00:34:29.760 |
Before I take one more question, I want to check in on have folks have any luck with the repo got thumbs up. 00:34:36.760 |
All right, I'm going to pause on questions for the time being. 00:34:40.760 |
I do want to try to go through the quick start if we could. 00:34:50.760 |
It's workshop, hyping voice, hyping Gemini, hyping pipe cat. 00:34:55.760 |
When is in there, answer questions, we're going to join and we're going to share links here. 00:35:01.760 |
So I would recommend maybe we'll just take like a few minutes of independent getting set up. 00:35:08.760 |
If you all go to the read me on in the Gemini pipe cat workshop. 00:35:13.760 |
And if you'd roll through the first few steps, maybe get a, I don't know, a hand up. 00:35:19.760 |
You could just flash it up real quick so I could see when people start to get through it. 00:35:29.760 |
So for this workshop, I have leaked my key, a key from one of my accounts, which I'll cycle 00:35:40.760 |
So you don't need to sign up for a daily account. 00:35:47.760 |
So in the environment.example, you'll see there's already a daily key, which you can use. 00:36:16.760 |
So in terms of the pipe, that's a good question in terms of the pipe cat interface, you interface 00:36:35.760 |
So there's a service class in within pipe cats. 00:36:37.760 |
So how you write your application code is going to be uniform across all of the services. 00:36:42.760 |
The individual providers haven't really, unlike the text based LLMs, haven't really settled 00:36:50.760 |
So pipe cat handles all that translation on your behalf. 00:36:53.760 |
And I think there are other frameworks that do similar things. 00:36:55.760 |
The idea is to provide like a uniform, simple interface. 00:36:59.760 |
So that you could take, and this is part of the modularity. 00:37:02.760 |
If you wanted to, you could swap this bot out for open AI real time or, you know, a text based 00:37:13.760 |
There is maybe a little bit in terms of the one thing LLM providers, maybe I don't know 00:37:21.760 |
The system instruction or system prompt is not uniformly dealt with. 00:37:27.760 |
But open AI has this like user that is system that you can inject anywhere at any time, 00:37:34.760 |
But Anthropic and Google require like a named system instruction that's a special one time, 00:37:51.760 |
Any -- how are folks doing on getting Quick Start going? 00:37:55.760 |
Is that a question or -- okay, I'll take -- maybe it's related to the Quick Start. 00:38:08.760 |
Yeah, so there's a question about noisy environments, which is like voice AI's kryptonite. 00:38:23.760 |
But there are -- again, this is where like PipeCat is the assembler of all things. 00:38:28.760 |
You want to plug in -- that you can run separately from the VAD. 00:38:34.760 |
We found like Crisp is one that's a partner of ours. 00:38:40.760 |
You can run something outside of the loop that would actually clean up. 00:38:44.760 |
like in the -- in PipeCat, it would be in the transport itself. 00:38:47.760 |
So in that audio input, it would take the audio input and remove any ambient noise. 00:38:54.760 |
But maybe more impressively, human background voice. 00:38:59.760 |
So you could be in this conference and it picks up the primary speaker for the device like incredibly well. 00:39:04.760 |
At the moment, they're the only ones that I'm aware that know how to do that and do it that well. 00:39:11.760 |
The VAD is -- I mean, it was one of the biggest problems that we saw until we found Crisp and they're fantastic. 00:39:31.760 |
So we bring -- I think options are Gemini, Multimodal Live, OpenAI Realtime, and then AWS just launched a new one called Novosonic. 00:39:43.760 |
They're all pretty -- they're all very much -- actually, well, they're all very similar. 00:39:55.760 |
I'm not going to -- I don't want to, like, nitpick on all of the vendors here. 00:39:59.760 |
But they're -- I mean, they're -- they have strengths and weaknesses each because it's still an emerging field. 00:40:03.760 |
But they're -- latency-wise, that is not an issue. 00:40:19.760 |
Why don't we -- he's going to just do some live coding here, and maybe this will help to understand how this all works. 00:40:24.760 |
And then perhaps we -- I stop chatting, and maybe you could grab me if you have questions and just do some heads-down working time. 00:40:40.760 |
I'll try from the very, very -- from nothing, from scratch. 00:40:52.760 |
The first thing we'll do is create an environment, like a virtual environment that it's called. 00:41:00.760 |
That's how you create a virtual environment in Python. 00:46:11.760 |
And we also need a VAT analyzer, which is gonna be a Cilero VAT analyzer. 00:46:19.760 |
So the transport is gonna be able to use this VAT analyzer to detect if the user has spoken or not. 00:46:41.760 |
In this case, it's Gemini Live, so I don't need to create a speech-to-text or text-to-speech. 00:46:52.760 |
And for that, I'll need to copy it, 'cause I don't know that from memory. 00:47:15.760 |
And, again, it uses this Gemini Multimodal Live LLM service. 00:47:29.760 |
The system instruction, which is like what the agent is gonna do, and some tools. 00:47:52.760 |
So the system instruction is you're a helpful assistant who can answer questions and use tools. 00:48:26.760 |
I'm gonna avoid storing the context because I don't think we need it for now. 00:48:35.760 |
The pipeline just receives a list of processors or elements. 00:48:46.760 |
So how we get audio from the daily room in this case. 00:49:04.760 |
So you could build a pipeline of pipelines of pipelines of pipelines of pipelines. 00:49:09.760 |
Or you can plug and play the way you like that. 00:49:46.760 |
You can create more than one pipeline task if you wanted. 00:50:29.760 |
And I need to load the environment variables. 00:50:44.760 |
This is just a function that imports the environment variable. 00:51:28.760 |
So at the beginning I wrote that file requirements.txt. 00:51:33.760 |
Which has had a bunch of, well, I got just a few requirements. 00:51:49.760 |
In the meantime, I'm just gonna go to the daily room that I just pointed the bot to. 00:52:10.760 |
So that's, right now it's just me in that room. 00:52:16.760 |
So, and now we just have to wait for this to, to finish. 00:52:21.760 |
And hopefully the bot will join the room and we'll be able to talk to it. 00:52:32.760 |
How it forces, like daily as part of the splutter. 00:52:40.760 |
It is in the, yeah, it's, this is because we're using the daily transport. 00:52:45.760 |
And the daily transport just connects to a daily room. 00:52:48.760 |
So, but you could have a, a web socket transport. 00:53:11.760 |
So we also, uh, in pipe cat, we also have added, um, based off of the AIO RTC Python package, 00:53:20.760 |
We've added a, a new transport called small WebRTC transport. 00:53:24.760 |
It is a peer to peer WebRTC communication that's free. 00:53:29.760 |
Uh, though the one downside is that it requires a turn server, which we, you bring your own. 00:53:35.760 |
So we, we didn't, you know, we weren't prepared for that for the conference. 00:53:38.760 |
And also just the conference wifi makes that a little challenging. 00:53:40.760 |
But normally if you're running any of the, any of the, uh, we call them foundational examples 00:53:46.760 |
Think of them as the like essential, um, examples that show how to do very specific functions. 00:53:52.760 |
There's probably about a hundred of them in pipe cat, but one by one, it shows you how 00:53:55.760 |
to like record or add an STT or push frames or show images or sync images and, and sound. 00:54:02.760 |
Those all use the peer to peer WebRTC transport. 00:54:05.760 |
So we, we would have loved to have used that. 00:54:07.760 |
You wouldn't need a key, but unfortunately firewall rules have trumped. 00:54:14.760 |
Uh, so I'm just running the bot and see how it fails because it has to fail the first time. 00:54:28.760 |
Cause there's a bunch of the, um, Python packages and Python just decides to take a time to load 00:54:37.760 |
But you see how easy it was to write, um, like an agent, like a voice agent with Gemini 00:54:47.760 |
Uh, if it worked, it'd be just a few lines of code, uh, that we wrote in, I don't know how 00:54:53.760 |
long it took me, but, uh, maybe like five, 10 minutes. 00:54:58.760 |
Are there any questions on the example or what's that? 00:55:15.760 |
There's a couple of things that popped up there with the words later in it. 00:55:29.760 |
We have, I don't know the actual number of customers, but I mean, pipe cat probably serves hundreds 00:55:43.760 |
I mean, pipe cat is made by some very large companies in production, uh, and people are 00:55:48.760 |
contributing to it from Nvidia, AWS, uh, open AI, Google, uh, lots of, lots of big companies. 00:55:55.760 |
There's one thing we didn't mention about pipe cat is that what you see now in the screen, 00:56:01.760 |
these runs on the server side, but we do have client as the case for Android, iOS, JavaScript, 00:56:08.760 |
and React, and I think that's about it, but, uh, even a, a C++ client, um, uh, if you want. 00:56:21.760 |
So that's the server side, but you can plug your, your client and connect to the, to the agent 00:56:29.760 |
Uh, that would be, it, that depends on the, uh, transfer you use, but yeah, you could, 00:56:36.760 |
you could have your client connect to a daily or we support life kit as well, but to a daily 00:56:42.760 |
room, we like daily because we are working daily, but, uh, you connect, you can connect 00:56:47.760 |
to a daily room and then the bot would connect or the agent would connect to the daily room as 00:56:53.760 |
And then that's the transport, the web RTC transport. 00:57:18.760 |
Um, yes, yes, there is, um, actually, uh, for the previous version, um, I just hacked 00:57:28.760 |
together a thing called, uh, that I call release evals, which is a bot talking to a bot. 00:57:33.760 |
And what it does is I put this bot, uh, up and then it joins a daily room and then I have, 00:57:40.760 |
uh, an eval bot and the eval bot, um, what he's going to do, it has a prompt, which is ask a 00:57:49.760 |
And then that eval bot is going to connect to the room. 00:57:54.760 |
And then the other bot is going to reply two plus two is four. 00:57:58.760 |
And the eval bot, the LLM, uh, it checks if the answer of the user is correct. 00:58:11.760 |
The good thing is we, we run the, we used to run like more than a hundred examples every 00:58:19.760 |
So I just got tired of it cause it's very painful and very slow. 00:58:23.760 |
So we have these, uh, eval bot or, uh, release evals that are gonna test each service. 00:58:31.760 |
Like we test Gemini live, we test Cartesia, deep gram, like all the services like end to 00:58:38.760 |
And then the bots basically talk to each other with voice. 00:58:47.760 |
Maybe real quick just to show for those that didn't see it, it is. 00:58:58.760 |
Which, if you build pipe cat, I've heard that joke probably, I mean, not like 5,000 times. 00:59:06.760 |
Like you can even try to seed it with something different, but it's still- 00:59:18.760 |
Why do you think people keep telling the same jokes? 00:59:27.760 |
From my search, there appear to be several reasons why people repeat jokes. 00:59:33.760 |
People repeat things they find funny because they want to re-experience the good feeling associated 00:59:39.760 |
And repeating it is a way to try and recreate that sensation. 00:59:47.760 |
Is there anything else I can help you with today? 00:59:54.760 |
Well, just to show there was a question about interruptions. 00:59:55.760 |
We could just have it, like, my favorite is to ask it to tell you like a really long story 01:00:08.760 |
And feel free to interrupt whenever you like. 01:00:21.760 |
Well, I find, like, a lot of conversations, like, it's like a child's based on when you pause 01:00:40.760 |
So I think the questions asked during this workshop could map out, like, years of work. 01:00:57.760 |
So this is, like, another one of those fantastic cutting edge things. 01:01:01.760 |
So, again, back to, like, human evolution, we all know, and when we talk, actually, it's 01:01:06.760 |
even hard for humans to talk to not speak over each other. 01:01:09.760 |
So the way that it works mechanically is when the user stops speaking, the VAD has a timeout. 01:01:14.760 |
You tell it and program it, wait, let's say, one second, 0.8 seconds, half second, whatever 01:01:20.760 |
And you're trying to balance low latency response with giving the user enough time to speak. 01:01:26.760 |
And it's one of the biggest complaints is that agents will speak over the human. 01:01:30.760 |
So if you're, let's say you're building an interview bot, like, you're using, like, Tavis, one of their 01:01:35.760 |
digital twins, you want to have, like, a real, like, likeness, and you want to speak to it. 01:01:40.760 |
You may take time to think, because sometimes you have to take time to think. 01:01:44.760 |
And that's a really difficult thing for bots to do, because, again, it's driven by, like, a simple stop-speaking 01:01:50.760 |
So this is a new, like I said, it's like an emerging field of models, which is looking at semantic end of turn. 01:01:58.760 |
So driven off of things like speech filler words, pauses, intonation, so things in the audio realm, and also things in the text-based realm, so just looking at context. 01:02:11.760 |
So we've actually started, we're one of many that are doing this, I think, but we launched a model. 01:02:17.760 |
If you look at it on GitHub, it's under smart dash turn. 01:02:20.760 |
It's a native audio in classifier that runs an inference on the input audio, and it simply outputs either complete or incomplete. 01:02:30.760 |
And the way PipeCat uses this is that if you get an incomplete response, we can dynamically adjust the VAD timeout. 01:02:36.760 |
So we can tell the PipeCat bot, okay, he or she is not done speaking. 01:02:41.760 |
Let's actually move the, let's give three seconds to complete the thought. 01:02:45.760 |
And if it's not done, then the bot will actually respond. 01:02:47.760 |
So you can create a little bit of, like, dynamic interaction there. 01:02:52.760 |
I'm sure the Google team is working on similar things. 01:02:55.760 |
I know OpenAI is, and all the SDT vendors are also looking at their own things. 01:03:00.760 |
So I'd say, right now, it is very much an unsolved problem. 01:03:03.760 |
But I would imagine, given how fast things are going in the next 12 months, we'll have great solutions that will make it even more natural to talk to a bot. 01:03:21.760 |
Well, actually, I was kind of wondering, like, is there a way to see the transcripts of that happening? 01:03:34.760 |
One, this is actually back to the, well, for, this is specific to PipeCat, but also, like, Gemini Live will output audio and text. 01:03:45.760 |
PipeCat offers, in terms of its, again, orchestration role, when you get a, actually, it's going to be specific to TTS provider. 01:03:55.760 |
Many, there are great TTS providers that do word and timestamp synchronization. 01:04:02.760 |
So if you're using a Cartesia or an Eleven Labs or Rhyme, they all output these pairs. 01:04:08.760 |
One of the really cool things with PipeCat is that the TTS services output not only the audio stream, but also the text stream. 01:04:15.760 |
So they'll output text frames, TTS text frames, we call them in PipeCat. 01:04:19.760 |
And if you place, we have, in terms of how the client software works, there is, like, an observer role, where you can actually watch. 01:04:27.760 |
There's a process that can watch things that happen in the pipeline and emit events. 01:04:30.760 |
So we've instrumented that for the clients so that whenever you see those text frames move through the transport, you can get synchronized word and audio output. 01:04:39.760 |
So in your client, if you wanted to have word-by-word output synchronized to the audio, you can do that with PipeCat. 01:04:47.760 |
I think you listen to, like, bot TTS text output or on bot TTS text, and it will give you the synchronized output. 01:04:53.760 |
If I wanted to build a fully offline voice engine, what box in that pipeline would be hardest to do? 01:05:05.760 |
Fully offline? Well, they're all doable. They're great models. 01:05:08.760 |
I think it really depends on what your bot needs to accomplish. 01:05:11.760 |
A lot of the state-of-the-art models, to do all the best and smartest things, need to have some. 01:05:16.760 |
Like, they're going to be run, like, on-prem or in the cloud. 01:05:20.760 |
But if you have -- and a lot of bots do jobs. 01:05:23.760 |
Like, if you wanted to build, like, a restaurant reservation one, like I referenced earlier, it's a very simple job. 01:05:28.760 |
You could probably run it with some version of Llama running locally. 01:05:32.760 |
There are great local -- something, again, Quinn has been experimenting with -- a lot of great local models. 01:05:39.760 |
Like, whisper has challenges, you know. I mean, it's -- it has a lot -- it has some challenges as an open source model for STT. 01:05:50.760 |
But there are good and emerging TTS services. So, you know, I -- things are only as good as the input. 01:05:56.760 |
And we've actually seen this with some of the speech-to-speech models that sometimes they mistranscribe. 01:06:00.760 |
So, you really need -- I mean, it's -- you know, every part is critical, but if you can't transcribe the speech really well, nothing really matters. 01:06:06.760 |
Like, it has to understand you. And having, like, disfluencies or, like, hallucinated responses or even just inaccurate responses kind of breaks everything down. 01:06:14.760 |
So, things mostly start at the STT. So, maybe that's the hardest. I don't know if there are a lot of good open source options for that right now. 01:06:20.760 |
I don't know. We don't -- we're not doing anything in the STT world. No. No. No. No. That's a whole different ballgame. Good question, though. 01:06:31.760 |
Just made me realize we could have used local models and avoid this. 01:06:36.760 |
We could have. Yeah. Well, we're partnering with the Google team. 01:06:44.760 |
Has anyone looked at any of the sample projects and had questions? There's a lot of interesting things that we're doing. 01:06:49.760 |
There's a lot of interesting things there. If any of this has, like, interested you, we do have a Discord. 01:06:53.760 |
You're welcome to get on it. You can find us at pipecat.ai and find our Discord there. 01:06:59.760 |
You can ask questions. There's some really cool stuff with Gemini that can be done. 01:07:04.760 |
There -- in particular, in the PipeCat repo, we built -- I don't know if you know the game Catch Phrase, 01:07:09.760 |
where you describe a word and something, you know, guesses it. We built a version of that. 01:07:14.760 |
We had to brand it something else called Word Wrangler. And you, as the human, you describe a word, 01:07:18.760 |
and then you have the AI agent try to answer it. So we built a client-server version of that, 01:07:23.760 |
which I linked to in the repo. And then we have one that's a phone-based one that's, I think, 01:07:27.760 |
particularly sophisticated and interesting. So you might think, like, how the hell would I build this with a speech-to-speech model? 01:07:33.760 |
We actually use two Gemini agents in the same call, and we use a parallel pipeline where one agent is the host giving out the questions to the human user. 01:07:44.760 |
The other is the guesser. And we kind of limit the audio flow so that the guesser, the AI player, can only hear the user. 01:07:51.760 |
So there's a bunch of really interesting things getting into, like, majorly into the weeds of some of the powers of Pipecat. 01:07:57.760 |
But it also speaks to the strength of having just native audio input being really, really helpful. 01:08:01.760 |
So I'd recommend checking those out. Really cool, easy demos to run. 01:08:06.760 |
One's Twilio. The other is, again, a client-server. I think it's, like, a React Next.js project. 01:08:15.760 |
What's that? Word Wrangler? Yeah, I mean, we could run the Word Wrangler client app. 01:08:27.760 |
Welcome to Word Wrangler. I'll try to guess the words you describe. 01:08:32.760 |
Remember, don't say any part of the word itself. Ready? Let's go. 01:08:37.760 |
I'm going to skip to something easier. Okay, this is something you take pictures with. It's on your phone. 01:08:48.760 |
All right, this is a field related to the study of languages, I think. 01:08:56.760 |
All right, this is a game of the yellow ball you play with rackets. Hit the ball over the net. 01:09:04.760 |
All right, this is a round dessert with chocolate chips sometimes and other fun goodies. 01:09:13.760 |
It's really good even when I'm bad at giving answers. 01:09:16.760 |
So, pretty cool. This is built with Gemini Live. 01:09:19.760 |
But, again, just an example of things you can build with voice AI. 01:09:27.760 |
All right. I think that's about it. Thanks, everybody.