back to index

Full Workshop: Realtime Voice AI — Mark Backman, Daily


Whisper Transcript | Transcript Only Page

00:00:00.000 | I'm Mark with Daily. This is Alesh. And then we have a few other Daily folks:
00:00:20.240 | Quinn, Nina, Varun, and then I'm not sure where he went but Philip from right there,
00:00:27.760 | from the Google team, Google DeepMind team. So this session we're going to spend just a
00:00:32.320 | few minutes getting everyone started. The idea here is going to be a hands-on
00:00:35.600 | workshop where all the folks I just called out are going to be available to
00:00:39.820 | help out. We'll walk you through a quick start to get you up and running and then
00:00:44.140 | the idea is to build something. So build a voice bot in the next 78 minutes and 12
00:00:49.300 | seconds or whatever time we have left. The one, I guess there's one consideration
00:00:55.520 | is the Wi-Fi. If you don't have good Wi-Fi you might want to try to tether. I was
00:01:00.240 | able to tether and it worked fairly well but the conference Wi-Fi was a little
00:01:03.280 | shaky. This is real-time so you will be streaming data. It does require a viable
00:01:08.900 | connection and not just sending a few bits over. So just a heads up if you hit
00:01:13.040 | that as a snag. So I guess before I get started, who here knows about PipeCat or
00:01:19.760 | has built anything with voice AI? Okay. A smaller audience. Has anyone built any real-time
00:01:27.800 | applications with LLMs or AI? Maybe slightly bigger? Okay. Great. So PipeCat is a, it's a open-source repo. It's a Python framework for building voice AI.
00:01:31.040 | voice and AI multimodal agents. And it's built by the team at Daly. But we're an open-source, it's an open-source project that anyone can contribute to.
00:01:49.760 | It's been around for, I don't know, just over a year now? Yeah, like I would say officially PipeCat was March 2024, something like that. Okay. So 13 months, there we go. So just a quick walk through, maybe just to kind of ground everyone in the thinking around voice AI. These slides weren't built for this talk, but I'm going to use them.
00:02:13.760 | So the, you know, voice AI or real-time applications are tough because there's just, you know, we as humans communicate all the time with each other, thousands, tens of thousands of years of evolution baked into our brains. So it's pretty tough to make a machine, you know, work on the same level. So we have great expectations being the user in it. So, you know, you need a good listener, some of those smart and conversational. You need to be connected to data stores.
00:02:38.760 | It has to sound normal or natural. Think back to even just maybe two, three years ago, what your voice bots sound like, and many of them, if you call them on the phone, still sound like it needs to sound natural. And actually, kudos to the Google team, the latest Gemini Live native audio dialogue is quite good in that regard. It has to be fast. So the whole end-to-end communication needs to happen. And roughly, you know, kind of the benchmark is around 800 milliseconds.
00:03:07.760 | You could strive for more. I think we see maybe on the human level, it might be 500 milliseconds or somewhere on that order. So it is pretty fast. So there's a lot to kind of to get all the way there. And this is something that we at Daily, everyone building PipeCat has been working very, very hard on getting all the way to meeting all these expectations.
00:03:28.760 | So just to kind of ground you in some of this, since we're going to be working in PipeCat. PipeCat has a pipeline.
00:03:33.760 | I don't know if maybe Alessio, you want to talk a little bit about the origin of that quickly. Sure, sure. You can think about it as a multimedia pipeline. And you would think what is the multimedia pipeline is basically just think about like boxes that receive input and input could be audio or video. And then those boxes just will stream those same data or modify data or new data to the following,
00:04:00.760 | to the following elements or processors in PipeCat. Well, we call them processors. So in PipeCat, you would have a pipeline where you have a transport, which is the
00:04:12.760 | the transport of your data or the input of your data. For example, when you're talking, you could be talking in a meeting. So that would be the audio of the user. Then you would have another box following that, which is the speech to text service. So the speech to text service would be the audio of the user.
00:04:22.760 | It would transcribe it, then you would get text. That would be the following data that goes through the pipeline. And then the next one would be the LLM. So now the LLM has what the user has said.
00:04:34.760 | And then it generates output, whatever, whatever the LLM that would be tokens, then those tokens are converted into text to speech. And then the text to speech outputs audio and then the audio goes back to the transport so you could hear what the LLM has said.
00:04:50.760 | Today, what we're going to do today with Gemini Live, a lot of those boxes go away because the LLM will do a bunch of these things. It will do transcription, it will do the LLM, it will do the text to speech in one of these boxes.
00:05:11.760 | But you might still require, for example, if you want to save the audio, record the audio into a file, you need a bunch of utilities to do that. And pipecat has all that built for you. So basically, that's it.
00:05:32.760 | Yeah. I mean, a lot of this is really just its orchestration. So if you think about what pipecat offers, it's orchestration. It also offers a lot of abstractions for a lot of common utilities, like Alesh had said.
00:05:44.760 | So recording, transcript outputs, artifacts you might want to produce, or even ways that you might manipulate the information in the pipeline itself. So this image here, which is what Alesh actually just talked through, is what you would call, I guess, a cascaded model, where you have this flow through of information.
00:06:01.760 | So we're going to be, you can build with Google and many different services in this way. In the last year, there's been an emergence of speech to speech models that now take audio in natively and audio out natively.
00:06:12.760 | And those models also allow for audio in, and then optionally, text and/or audio out. So you can actually, for example, take a raw, you know, microphone input, or, you know, audio input, and then the model would run all of its logic.
00:06:30.760 | So you can actually opt to have it output text, if you want to say, parse the text output before speaking. So there are a few different demos we'll look at that offer that.
00:06:38.760 | And in pipecat, we show kind of all the ways to do things, because that's what we offer, or at least what pipecat offers as a value proposition. So the--
00:06:47.760 | Yeah, just one thing. I don't think you'll mention it in the slides, but all these boxes, you can pluck and play the service you want in pipecat. So the speech to speech, speech to text, it could be, I don't know, Deepgram, for example, the LLM could be Google, or OpenAI, or whatever. You can just pluck and play any service you want.
00:07:07.760 | Right, yeah. The modularity, I guess, is the other big strength. So there's no, you know, you can change out a service without changing out your underlying application code, which makes it easy.
00:07:15.760 | And we see with this, a lot of companies that are building for voice AI might have maybe even a more complex thing. A pipeline here runs straight down, but you can actually have split branches where you might have one leg that's running some logic and the other running different. We call that a parallel pipeline.
00:07:31.760 | So if you wanted to have, say, a failover, if vendor A goes down, you can move to vendor B dynamically, even within the same conversation. That's something that pipecat affords as well. And that can allow you to transfer context over. So a lot of really cool stuff.
00:07:47.760 | The goal again today being to get you familiar with just building a voice agent and building one to get started. So one of the cool things, like Alesh had pointed out, that
00:08:00.760 | with the cascaded models, there's a lot of complexity, but with your speech to speech model, things get dramatically simplified.
00:08:07.760 | You know, your code may have looked something like this. This is like an old example of some like a ton of orchestration in the pipeline. But with a speech to speech model, you may be able to simplify it down to this, but then you have to remember, you actually need orchestration around it. So it does get simpler to some regard. So it's more about the services you interface with.
00:08:29.760 | I think with that, why don't we transition now because I'm realizing we have only about maybe 70, 70, 75 minutes left to looking at the actual activity for today.
00:08:40.760 | All right, so there's a public repo. I don't know how big or small this is, but it's under daily co. So on get out daily co daily dash co slash gemini dash pipecat dash workshop.
00:08:53.760 | I'll give everyone a chance to give everyone a chance to make sure internet's working.
00:08:58.760 | Can everyone see the repo?
00:09:00.760 | that's really tiny.
00:09:01.760 | Yeah.
00:09:02.760 | Okay.
00:09:03.760 | I should have it in big text somewhere.
00:09:05.760 | No, no, not really.
00:09:08.760 | There we go.
00:09:13.760 | All right.
00:09:26.760 | Let's take a look at this.
00:09:31.760 | So what I want to do, so I spent a little bit of time, Elish and I spent time writing up this repo.
00:09:36.760 | This is meant to be just a jumping off point.
00:09:38.760 | I'm going to get you oriented and then I want to look through one of the bot files, which is kind of the main pipecat code with you.
00:09:45.760 | And then we'll break and make this an interactive session where we can answer a bunch of questions.
00:09:50.760 | So in the repo, you could either start doing it now or maybe take a pause, but this will give you the steps to walk through getting the quick start running.
00:10:00.760 | Before we do that, I want to take a moment here.
00:10:05.760 | Let's see how the Wi-Fi is doing.
00:10:07.760 | It's not a good sign.
00:10:09.760 | It's down.
00:10:12.760 | Well, hey, you know, it is tough to do Wi-Fi for this many people.
00:10:18.760 | So very, very tough.
00:10:19.760 | Instead of real time, it's going to be real slow.
00:10:21.760 | Real slow.
00:10:22.760 | Yeah.
00:10:23.760 | Real slow voice communication.
00:10:25.760 | All right.
00:10:27.760 | Yeah.
00:10:28.760 | There will be some.
00:10:30.760 | Okay.
00:10:31.760 | So here is this GeminiBot.py files.
00:10:35.760 | This is all Python again.
00:10:37.760 | So everything will be in Python.
00:10:38.760 | There will be some client code options, which we'll look at in a second.
00:10:43.760 | Just to orient you, I'll just jump right into the meat of the pipeline.
00:10:47.760 | We have this main function that runs your bot.
00:10:52.760 | Everything is going to run kind of encapsulated within an AIo HTTP session.
00:10:56.760 | We're going to pass that session around.
00:10:58.760 | That's more of just kind of the mechanics of things.
00:11:00.760 | In our pipeline, let's just jump to the simple part here.
00:11:04.760 | We'll have daily as the transport, daily as we are a WebRTC provider as well as we build
00:11:10.760 | pipecat.
00:11:11.760 | There will be context aggregation.
00:11:13.760 | So one important note, when you speak, every turn of the bot is like a discrete point in time.
00:11:21.760 | And this is maybe less so the case for a speech-to-speech model.
00:11:24.760 | But for basic LLMs, they get discrete inputs.
00:11:27.760 | Everything is like a REST API call.
00:11:29.760 | So you're going to get a snapshot of the conversation.
00:11:31.760 | The context aggregator is going to collect all the bits of the conversation, both from the user and the assistant.
00:11:39.760 | And we'll put that into the form of what the LLMs can handle.
00:11:43.760 | So this, in this case, is more for function calling and kind of logistics and management.
00:11:50.760 | Gemini is amazing because it offers a lot of this for you.
00:11:53.760 | But if you're to build with, say, the -- just build with Gemini not live, the actual kind of just the text-based LLM,
00:12:01.760 | you'd have to have this context aggregation.
00:12:03.760 | That will then go to your LLM, which is going to be Gemini live, Gemini multimodal live.
00:12:08.760 | And then it's going to be outputted through daily again on that side of the transport.
00:12:15.760 | So you have daily.
00:12:16.760 | We'd configure our service, which takes a number of arguments.
00:12:19.760 | Like you set up a room with a token, give it a name, and then have some properties.
00:12:24.760 | There are docs, which I'll link to.
00:12:26.760 | It's linked in the quick start.
00:12:29.760 | There's also a Gemini multimodal live LLM service, which is a pipecat class that is a wrapper around the Gemini live API.
00:12:37.760 | So this, again, you just initialize and run.
00:12:42.760 | With the LLM, you see we do a few special things.
00:12:45.760 | We're going to define tools.
00:12:47.760 | This one has just two really basic kind of canned functions.
00:12:50.760 | Fortunately, we're not calling out to the internet because it's not working very well.
00:12:53.760 | Our connection is not working well.
00:12:56.760 | So this one just has the dummy, like fetch weather, and we'll give you a restaurant recommendation.
00:13:00.760 | So these are two handlers that when your function is called, we'll just return this result information.
00:13:07.760 | So we have the actual functions themselves that are defined in this function schema, which is -- you can use just native function definitions using whatever LLM format.
00:13:19.760 | We also created this function schema, which is a universal schema that lets you define and move between any LLM without having to kind of transform your LLM calls from OpenAI to Anthropic to Gemini to Bedrock.
00:13:33.760 | Because they're all a little bit different or Grock.
00:13:35.760 | You know, they all have slightly different formats.
00:13:37.760 | So this is more of kind of a universal transform for that.
00:13:40.760 | And then they're collected and translated into the native format in this tool schema.
00:13:47.760 | So we'll pass the tools then to the Gemini service, and that's how it gets access to use and run those tools.
00:13:54.760 | There's also a prompt above, which I think in this simple example, we just say, hey, you're a chat bot.
00:13:59.760 | You have these tools available, and that's that.
00:14:02.760 | We're also setting up our context aggregation, which, for better or worse, we use OpenAI as kind of the default, like the lingua franca for context.
00:14:13.760 | So everything gets kind of folded back into OpenAI at a certain level.
00:14:17.760 | And then we define the pipeline.
00:14:20.760 | So the pipeline is, again, like Elesh had said, just a list of all of your different -- or I guess a tuple of all of your different services that are running in the pipeline.
00:14:31.760 | And you can write your own.
00:14:32.760 | So if you want to instead make the LLM output text and you want to either extract information from them, maybe you have it in code, some XML or some type of information,
00:14:42.760 | you can actually extract it, store it.
00:14:44.760 | Maybe your application does something with it.
00:14:47.760 | Or maybe you want to inject a text-to-speech type of frame.
00:14:52.760 | You can actually do that by separating the LLM, that audio output, from audio output to just be text.
00:14:59.760 | And then you would add, like, a text-to-speech service here.
00:15:02.760 | Or you could write your own processor, which may be not within the next hour.
00:15:05.760 | And then lastly, all of these -- or not all, but many of them have events they emit.
00:15:10.760 | The transport emits handlers for when the client connects and disconnects.
00:15:15.760 | So in this case, we use this line here to actually inject a context frame into Gemini to kick off the conversation.
00:15:24.760 | So when your client application connects, it's going to queue a frame.
00:15:28.760 | So again, frames being kind of the base format for information.
00:15:33.760 | Think of it as like an object for your pipeline.
00:15:35.760 | You're going to queue one of those frames.
00:15:37.760 | And what this function does is it just grabs the latest context.
00:15:41.760 | So when we set this one above, I think that just says hello, that's going to pass that into --
00:15:47.760 | basically push that into the pipeline, which then it will make its way to Gemini to initialize the conversation.
00:15:53.760 | And the rest of this, you could think of as boilerplate to run it.
00:15:56.760 | You create a runner, which then is the thing that actually runs your task.
00:16:00.760 | And the task is what runs the pipeline.
00:16:02.760 | So maybe beyond today, but just know that that's something that's required to run your code.
00:16:07.760 | Okay.
00:16:08.760 | I'm going to pause here just for any questions, because I do want to get to developing soon.
00:16:12.760 | So you said with the web app, you can provide a web app, you can provide a web app.
00:16:19.760 | Mm-hmm.
00:16:20.760 | We provide both.
00:16:21.760 | The whole topic that is a talk in itself.
00:16:24.760 | The short answer is if you're building a client-server app, you should, with strong emphasis, use WebRTC.
00:16:31.760 | It has a whole bunch of properties that are relevant, like error correction, better audio quality, et cetera, et cetera.
00:16:38.760 | If you're building server-to-server, so one of the options today is to build a phone chatbot, you can use --
00:16:43.760 | and it's probably the best option to use WebSockets.
00:16:45.760 | So you can bring your own transport.
00:16:48.760 | We actually -- in PipeCat, there's a fast API version of that that's a server that you can use to exchange messages with a WebSocket.
00:16:55.760 | So, yeah, it's really up to you.
00:16:57.760 | So I guess maybe the takeaway is if you're building client-server, you really want WebRTC, you could technically use WebSockets, but you'll hit like a long tail of errors when you get to production.
00:17:07.760 | And then server-to-server, totally fine.
00:17:09.760 | You're going to be fine with WebSockets.
00:17:14.760 | Could anyone download the repo by any chance?
00:17:18.760 | Did anyone not download the repo?
00:17:20.760 | Just one person?
00:17:21.760 | Two persons?
00:17:22.760 | Three, four?
00:17:23.760 | Wi-Fi is dead.
00:17:24.760 | Wi-Fi is dead.
00:17:25.760 | Can anyone -- can you tether?
00:17:26.760 | Is that an option?
00:17:27.760 | Do folks have tether hotspots on your phone?
00:17:28.760 | Or -- that's even shaky.
00:17:29.760 | Oh, geez.
00:17:30.760 | Okay.
00:17:31.760 | I can dance.
00:17:34.760 | I could do some -- I don't know.
00:17:37.760 | You know, I'm not really -- you know, it's not my thing, but I could try.
00:17:40.760 | True.
00:17:41.760 | Yeah, I guess I could just make this code walkthrough.
00:17:44.760 | Or we could write it.
00:17:47.760 | What -- yeah.
00:17:49.760 | Yeah.
00:17:50.760 | Can I have its own VAD?
00:17:51.760 | Or --
00:17:52.760 | Gemini does have its own VAD.
00:17:53.760 | Yeah.
00:17:54.760 | What's your key to the choice?
00:17:56.760 | Gemini does have its own VAD.
00:17:57.760 | In fact, if you're using a speech-to-text service, that likely also brings its own VAD to the equation.
00:18:03.760 | What we've found is that -- so maybe to use some of our extra time because we're having internet issues --
00:18:08.760 | the VAD serves a really important purpose of detecting when a user starts speaking.
00:18:13.760 | So in the whole kind of life cycle of a turn, that user speaking kind of ushers in the user's turn for the conversation.
00:18:20.760 | So PipeCat will emit a user started speaking frame, and that will also push through an interruption.
00:18:26.760 | So the user will interrupt, like, anything that's talking to -- if the bot was speaking or whatever.
00:18:31.760 | It basically clears the way because the user has expressed that they want to speak.
00:18:36.760 | The idea with the VAD is that we want it to be extremely accurate and extremely fast.
00:18:41.760 | So running something on-device -- we recommend Solero, which is an open-source option.
00:18:45.760 | It works incredibly well.
00:18:47.760 | I don't know what the inference time is like -- I don't know -- millisecond -- extremely fast.
00:18:52.760 | So you're going to get an event back fast.
00:18:55.760 | In fact, you have the ability to tune how long to hear human speech before that event gets emitted.
00:19:01.760 | And the defaults are pretty good in PipeCat, and there are maybe scenarios where you want to change that.
00:19:05.760 | But the VAD is a really important consideration.
00:19:08.760 | It's extremely low CPU consumption.
00:19:11.760 | Quinn has a great spreadsheet of breaking down the full cost analysis of an agent.
00:19:17.760 | And really, the CPU is going to be extremely low.
00:19:19.760 | Interestingly, the TTS tokens or characters are the most expensive by far.
00:19:24.760 | So when you think about it, running that local VAD gives you superior performance.
00:19:29.760 | And it allows you -- and there's not much of a cost to it.
00:19:32.760 | I mean, it's maybe like a fraction of 1% to run a local VAD.
00:19:35.760 | But, yeah, you have all sorts of choices.
00:19:38.760 | But we find that to work really well.
00:19:40.760 | It's integrated with phone carriers, too.
00:19:49.760 | So there are -- I found out -- I didn't work much with phone.
00:19:54.760 | Varun's been our phone expert, though.
00:19:56.760 | I don't -- Varun is like a figure in the WebRTC community.
00:19:59.760 | He's like an author of many things.
00:20:01.760 | But he's also a phone expert, which I don't know if that crosses over too much.
00:20:05.760 | Maybe that happened before.
00:20:07.760 | Phones are super complicated, how you actually make calls.
00:20:13.760 | PipeCat supports all of them.
00:20:15.760 | So maybe a very quick list.
00:20:17.760 | You can make a WebSocket connection with a phone provider like a Twilio
00:20:21.760 | or Telnyx or Plevo or Exotel and exchange media streams.
00:20:27.760 | So that's a way to have just a native WebSocket connection from PipeCat to Twilio.
00:20:32.760 | You'd call Twilio.
00:20:33.760 | It's going to emit the WebSocket and there'll be a handshake to get connected.
00:20:37.760 | You can also use PSTN, which is a public switch telephone network, which lets you dial in.
00:20:45.760 | And that's going to be kind of a different mechanism.
00:20:47.760 | There's also SIP, which is its own separate thing.
00:20:50.760 | Again, and all of the telephony providers would also support this as well.
00:20:54.760 | With SIP, you would call something, say, like Twilio.
00:20:58.760 | You would call into, say, a server or a SIP provider like Daily, which offers SIP provided rooms.
00:21:06.760 | And you then have the ability to kind of bring the two together via that SIP connection.
00:21:10.760 | The nice thing about SIP is that you have the ability to have superior call control.
00:21:15.760 | It is slightly more complicated, whereas that WebSocket connection is instantaneous.
00:21:20.760 | Your bot needs to be up and running.
00:21:22.760 | So there's a whole, we're not going to talk about it today, but like cold starts for agents.
00:21:26.760 | They need to start immediately.
00:21:27.760 | So if you don't have resources provisioned, you don't want your users waiting like 20 seconds
00:21:31.760 | while the bot comes online.
00:21:33.760 | So a long-ish, medium-ish answer for a very complicated question.
00:21:38.760 | Maybe you didn't know that.
00:21:40.760 | Yeah, yeah, yeah.
00:21:43.760 | Yeah.
00:21:44.760 | Yeah.
00:21:45.760 | Yeah.
00:21:46.760 | I'll let you answer that.
00:21:47.760 | Yeah, yeah.
00:21:48.760 | Cartesia of--
00:21:49.760 | Yeah, can you, I didn't hear it.
00:21:50.760 | Well, I think you were asking if PyCAD compares to, or how it compares to Cartesia.
00:22:08.760 | Is that--
00:22:09.760 | Yeah, Cartesia is, well, I think you're asking if PyCAD compares to, or how it compares to Cartesia.
00:22:10.760 | Yeah, Cartesia is, well, I think they're going to do more stuff.
00:22:14.760 | But as of today, it's a text-to-speech service.
00:22:18.760 | And you can clone your voice, or you can, you know, and then they provide a real-time API.
00:22:24.760 | You can just via WebSockets, you pass the text, and they reply with audio, basically, with audio frames.
00:22:31.760 | And then PyCAD integrates with Cartesia as any other text-to-speech service, like 11 labs or anything.
00:22:38.760 | Think about PyCAD as a, it's just a framework for developers where they can plug and play the service they want.
00:22:46.760 | So they can plug and, they can take Cartesia, they can put Cartesia, they can take it out, put 11 labs.
00:22:52.760 | Oops, we're even closer.
00:22:54.760 | Okay, now I can't hear myself.
00:22:56.760 | Yeah, you can plug and play any service you want.
00:23:00.760 | Like, you can change LLM, you can use Llama, you can use Anthropic, you can use Cartesia or 11 labs.
00:23:08.760 | Then for speech-to-text you can use Deepgram, or you can use one box, like Gemini Live, which has all that built for you.
00:23:16.760 | So, is that clear? Yeah, yeah, okay.
00:23:24.760 | What about what you do?
00:23:26.760 | About what, sorry?
00:23:28.760 | One way, I mean, we're talking about, what is the product, right?
00:23:40.760 | For example, what about, right?
00:23:42.760 | So, what about this, how are you, like, talking about, what about, right?
00:23:56.760 | Like, you can say, I don't know, can I see that the model is going to be small, something that is correct to the platform, and the, or, and the user, say something that could not be, you know, like, a version or whatever, right?
00:24:11.760 | Mm-hmm.
00:24:12.760 | I'm not sure if I understood.
00:24:17.760 | Are you talking, like, how to ensure that the LLM says what the right thing, or not the right thing?
00:24:28.760 | Yeah, well, PipeCAD doesn't have control about that.
00:24:31.760 | It's up to you to define the prompt or define how the LLM will reply.
00:24:40.760 | You can put, if you want, you can put, you will be able to write your own processors, that we call them, those little boxes, and check what the LLM has said, for example, before it's, like, put some kind of real-time eval, like, to make sure that the LLM has, you could, you could do that, yeah.
00:25:01.760 | All that, that pipeline is very flexible, so you can put whatever you want there, like, in parallel, not in parallel.
00:25:10.760 | For example, Mark was saying about parallel pipelines, like, if you have video and audio at the same time, with Gemini Live you can do everything in one box, but let's say you don't have Gemini Live, you want to use other services, that one does video and the other one does audio.
00:25:28.760 | So, you know, you can have a parallel pipeline, which, you know, it's like a tree, right?
00:25:32.760 | You have your transfer input, and then if it's audio, it goes this way, if it's video, it goes that way, and you can, you can do things dynamically like that, yeah.
00:25:43.760 | This next question, yeah.
00:25:45.760 | Yeah, I have a related question.
00:25:46.760 | Is that common that people put some kind of check in place, and you see, like, what is the actual latency of what that produces, and then for a role of, like, Gemini Live, that produces audio, it seems like, would it also produce the text with it, or do you have to kind of do speech-to-text again, and then back to it?
00:26:09.760 | Okay.
00:26:10.760 | Sure.
00:26:11.760 | Okay.
00:26:12.760 | So the question was, are guardrails requirements, and how do people use them, and how does that apply for speech-to-speech models?
00:26:24.760 | So the answer is, they're not required, and there is a challenge here, and actually, I talked about this on, like, one of the first slides.
00:26:32.760 | Latency is absolutely critical, so what you want to avoid are unnecessary turns.
00:26:36.760 | You know, obviously, LLMs are amazing language processors, so if you had all the time in the world, you could do hallucination checking against the LLM.
00:26:44.760 | There are other strategies to handle this.
00:26:47.760 | One of the big things is that we see, because of the aggregate nature of the context, it grows over the course of the conversation.
00:26:54.760 | Actually, you can find better accuracy with the responses if you have more control over how you prompt the LLM.
00:27:02.760 | So this is a whole topic and talk in itself.
00:27:06.760 | What we found is there are two ways to handle this.
00:27:10.760 | Well, at least two.
00:27:11.760 | One, if you, for a lot of conversations, they're going to be task-oriented.
00:27:15.760 | So let's say something simple like a restaurant reservation bot.
00:27:19.760 | It may have to take your name, get your time, log the time to a database.
00:27:23.760 | You can chunk that out, even that small conversation, into just discrete tasks, and LLMs are really good at following the most recent input.
00:27:32.760 | So if you kind of feed it task by task, that helps.
00:27:36.760 | Also, if you control the context window, like the size, it can really be beneficial to kind of manage that really judiciously.
00:27:43.760 | So you could either reset.
00:27:45.760 | One example might be, let's say you're building a patient intake bot at a doctor's office.
00:27:50.760 | The very first thing it may do is verify the date of birth, which serves no utility beyond just the very first, you know, checkpoint.
00:27:57.760 | So you may actually remove that from the context, like completely get it out of there.
00:28:01.760 | Because otherwise, it's just cruft that hangs on.
00:28:04.760 | And instead, you kind of reset and then maybe roll through the tasks.
00:28:07.760 | You could also, for really, really long conversations, summarize the context.
00:28:11.760 | So you may want to do an out-of-band LLM call.
00:28:14.760 | And this is something actually, Quinn, just that we talked internally about this, that we're going to see more and more of this mixture of LLMs where, even in the context of real time, you may have an out-of-band, like, REST call to the text-based LLM just to do a summary and then return it back so that you can kind of compress that context window.
00:28:32.760 | And just to give a call out to Google, the live API, so maybe transitioning there, they offer context management through a bunch of different strategies.
00:28:41.760 | Like a rolling, they have a rolling window or sliding window.
00:28:44.760 | I think they offer, like, token caps for that so you can have some control.
00:28:47.760 | Or, if you want, you can output text and then kind of do whatever you want with it.
00:28:51.760 | They also take text input, so there's a lot of flexibility with speech-to-speech models.
00:28:56.760 | They do offer, or they do pose some other maybe development challenges, but offer, like, tremendous benefits in terms of the features they offer.
00:29:04.760 | I think there's a question for a long time.
00:29:06.760 | I'm going to hold questions just for a sec, because I've heard the Wi-Fi is back.
00:29:09.760 | Can folks try downloading the repo again?
00:29:12.760 | Slack channel.
00:29:13.760 | Slack channel.
00:29:15.760 | Quinn, can you maybe come to the mic?
00:29:17.760 | I can't remember what you told me.
00:29:19.760 | Workshop voice Gemini pipe cat with dashes in between on AI engineer Slack.
00:29:26.760 | Okay.
00:29:27.760 | Does everyone know where that is?
00:29:29.760 | I know this is, like, day one, hour, three.
00:29:31.760 | Workshop dash voice dash Gemini dash pipe cat is the channel name.
00:29:35.760 | So, if you go to the AI engineer Slack and search for workshop Gemini, it should come up.
00:29:40.760 | And Quinn will be posting links as we go along if the Wi-Fi stays on.
00:29:46.760 | If anyone can get on that channel, can you raise your hand?
00:29:50.760 | Can people, like, do they have, do you guys have access to the Slack, AI engineer Slack?
00:29:56.760 | Okay.
00:29:57.760 | Good.
00:29:58.760 | Slack.
00:29:59.760 | There's two channels.
00:30:00.760 | There's two channels.
00:30:01.760 | There's two channels.
00:30:02.760 | Well, we're just giving it.
00:30:03.760 | We're just giving it.
00:30:04.760 | Yeah.
00:30:05.760 | Oh, yeah.
00:30:06.760 | There's, like...
00:30:07.760 | Yeah.
00:30:08.760 | So, we'll see what that happened when the Wi-Fi was down and we were trying to create it.
00:30:13.760 | All right.
00:30:14.760 | They'll take a question over here.
00:30:15.760 | On your batch, there should be the join the Slack group.
00:30:17.760 | All right.
00:30:18.760 | I'd say...
00:30:19.760 | I'm going to...
00:30:20.760 | I'll take that one and I'll...
00:30:21.760 | If I could...
00:30:22.760 | Let's just try...
00:30:23.760 | Listen and also try to get the repo.
00:30:24.760 | I'd like to walk through the quick start and then we can look at some examples.
00:30:29.760 | You know, it's something actually...
00:30:30.760 | Quinn has done a ton of work with this.
00:30:31.760 | I have not personally.
00:30:32.760 | I mean, there's obviously massive latency benefits because you cut out network round trips all over
00:30:36.760 | the place.
00:30:37.760 | so you save a ton.
00:30:38.760 | And depending on where you are in the world, that can save a lot.
00:30:39.760 | If you're US-based, you know, your network latency is going to be relatively low.
00:30:43.760 | But a lot of the developers in the community are in Europe and a lot of these AI services
00:30:48.760 | are relatively new with, you know, data centers only in the US.
00:30:52.760 | So, there are different challenges when doing that, though, when running it.
00:31:08.760 | Actually, there are great hosting providers like Modal that offer really good options for
00:31:12.760 | running, like your own local LLM, which then you're not, you know, you're buying or I guess
00:31:17.760 | leasing like the GPU time instead of running everything on a GPU, which would probably be
00:31:22.760 | cost prohibitive because of the way processes run.
00:31:25.760 | But that's a...
00:31:26.760 | That is also a talk in itself.
00:31:28.760 | Next question.
00:31:29.760 | Sure.
00:31:30.760 | Oh, yeah.
00:31:46.760 | Definitely.
00:31:47.760 | Well, the large context windows definitely cause LLMs to...
00:31:58.760 | Yeah, I'm sorry.
00:31:59.760 | The question was around state management with LLMs and whether it's better to, I guess, chunk
00:32:08.760 | or have more kind of deterministic input versus just a large context where you just dump everything
00:32:14.760 | This is actually an extension of what the other gentleman was asking.
00:32:17.760 | The idea being that...
00:32:20.760 | And actually, so, daily we built the chat widget that's on the homepage of the voice AI
00:32:27.760 | World's Fair page.
00:32:28.760 | I personally built that.
00:32:31.760 | What was interesting there is that...
00:32:33.760 | And Alesh and I were just talking about this.
00:32:35.760 | Function calls in the context of real time are still slow, unfortunately.
00:32:39.760 | Like, too slow.
00:32:40.760 | Actually, Gemini...
00:32:42.760 | I'll give more props to Gemini being like maybe one of the fastest.
00:32:45.760 | If you run like a basic local, like one of these demos and you ask it what the weather is, it will return back with time to first bite in, I don't know, less than 500 milliseconds.
00:32:54.760 | Whereas other vendors, not trying to throw open AI under the bus, but it has gotten slower.
00:32:59.760 | Like the...
00:33:00.760 | You might see upwards of 1.5 to 2 seconds of waiting just to get that first token back.
00:33:05.760 | And we dug into...
00:33:06.760 | Actually, this is just something recently this morning.
00:33:08.760 | The issue is when you get the normal streamed response for the conversation, you can start playing that audio out once you get the first sentence.
00:33:14.760 | The issue being when you get a tool, you need the entire JSON response before you can actually do anything with it.
00:33:19.760 | So that's slow.
00:33:21.760 | That's one part of it.
00:33:23.760 | Separately, chunking the prompts is absolutely the way to go.
00:33:26.760 | And building that world's fair bot, I was kind of balancing between the two worlds because you can talk to it and ask it about speakers for the session.
00:33:36.760 | Route 1 would have said, let's use like a mag approach and put all of the speaker JSON in something that could be a tool that could be accessed.
00:33:45.760 | And I tried that and unfortunately, it's just a little too slow.
00:33:48.760 | It's a giant context.
00:33:49.760 | It takes a while to come back.
00:33:50.760 | What's interesting is if you instead move that all just directly into the context with Gemini Live.
00:33:56.760 | It's a little bit variable, but under good conditions, you'll get a response back on that like 800 millisecond latency.
00:34:03.760 | So it actually has access to like the full context.
00:34:05.760 | The one trade-off though is what this gentleman over here was asking is accuracy.
00:34:10.760 | It's going to get confused, especially when you get a JSON with a lot of speaker.
00:34:13.760 | And this isn't, this is all LLMs.
00:34:15.760 | It's not specifically Gemini.
00:34:16.760 | With that type of, even structured data becomes very hard to kind of discern what's what when a lot of it looks the same.
00:34:22.760 | So it's all, I mean, a lot of this is emerging like other things in AI, trying to just do it as fast as possible with voice.
00:34:29.760 | Before I take one more question, I want to check in on have folks have any luck with the repo got thumbs up.
00:34:36.760 | All right, I'm going to pause on questions for the time being.
00:34:38.760 | We can take them at the tail end.
00:34:40.760 | I do want to try to go through the quick start if we could.
00:34:43.760 | That would be great.
00:34:44.760 | And again, for now, we consolidate channels.
00:34:47.760 | There's one channel in the Slack.
00:34:50.760 | It's workshop, hyping voice, hyping Gemini, hyping pipe cat.
00:34:55.760 | When is in there, answer questions, we're going to join and we're going to share links here.
00:34:59.760 | So Mark, that's great.
00:35:00.760 | Okay, great.
00:35:01.760 | So I would recommend maybe we'll just take like a few minutes of independent getting set up.
00:35:08.760 | If you all go to the read me on in the Gemini pipe cat workshop.
00:35:13.760 | And if you'd roll through the first few steps, maybe get a, I don't know, a hand up.
00:35:19.760 | You could just flash it up real quick so I could see when people start to get through it.
00:35:22.760 | You don't have to hold it up.
00:35:23.760 | If you don't mind, if you brought a device.
00:35:26.760 | Okay, we've got one, two.
00:35:29.760 | So for this workshop, I have leaked my key, a key from one of my accounts, which I'll cycle
00:35:39.760 | after this.
00:35:40.760 | So you don't need to sign up for a daily account.
00:35:41.760 | You do need to sign up for a Gemini account.
00:35:44.760 | I don't have a key I can just give out.
00:35:47.760 | So in the environment.example, you'll see there's already a daily key, which you can use.
00:35:52.760 | Do people know how to sign up for a Gemini?
00:35:55.760 | I have it in the read me.
00:35:56.760 | You have it in the read me.
00:35:57.760 | It's through AI studio.
00:35:58.760 | Is that a good spot to do it?
00:35:59.760 | Okay.
00:36:01.760 | All right.
00:36:02.760 | I'll take a question while we're waiting.
00:36:16.760 | So in terms of the pipe, that's a good question in terms of the pipe cat interface, you interface
00:36:33.760 | with the pipe cat class.
00:36:35.760 | So there's a service class in within pipe cats.
00:36:37.760 | So how you write your application code is going to be uniform across all of the services.
00:36:42.760 | The individual providers haven't really, unlike the text based LLMs, haven't really settled
00:36:48.760 | on kind of a standard.
00:36:50.760 | So pipe cat handles all that translation on your behalf.
00:36:53.760 | And I think there are other frameworks that do similar things.
00:36:55.760 | The idea is to provide like a uniform, simple interface.
00:36:59.760 | So that you could take, and this is part of the modularity.
00:37:02.760 | If you wanted to, you could swap this bot out for open AI real time or, you know, a text based
00:37:08.760 | model with a TTS and STT paired with it.
00:37:11.760 | That's kind of the whole idea.
00:37:13.760 | There is maybe a little bit in terms of the one thing LLM providers, maybe I don't know
00:37:18.760 | if anybody here can nudge anyone you know.
00:37:21.760 | The system instruction or system prompt is not uniformly dealt with.
00:37:25.760 | I don't know if that's well understood.
00:37:27.760 | But open AI has this like user that is system that you can inject anywhere at any time,
00:37:32.760 | which is really fantastic.
00:37:34.760 | But Anthropic and Google require like a named system instruction that's a special one time,
00:37:40.760 | like at constructor time instruction.
00:37:44.760 | So there's some differences.
00:37:46.760 | As much as best we can, we unify.
00:37:49.760 | All right.
00:37:51.760 | Any -- how are folks doing on getting Quick Start going?
00:37:55.760 | Is that a question or -- okay, I'll take -- maybe it's related to the Quick Start.
00:38:05.760 | Oh, yeah.
00:38:07.760 | Great topic.
00:38:08.760 | Yeah, so there's a question about noisy environments, which is like voice AI's kryptonite.
00:38:20.760 | So with the VAD, no, not at the moment.
00:38:23.760 | But there are -- again, this is where like PipeCat is the assembler of all things.
00:38:28.760 | You want to plug in -- that you can run separately from the VAD.
00:38:34.760 | We found like Crisp is one that's a partner of ours.
00:38:37.760 | They have a fantastic noise cancellation.
00:38:40.760 | You can run something outside of the loop that would actually clean up.
00:38:44.760 | like in the -- in PipeCat, it would be in the transport itself.
00:38:47.760 | So in that audio input, it would take the audio input and remove any ambient noise.
00:38:52.760 | So like chip bags opening or dogs barking.
00:38:54.760 | But maybe more impressively, human background voice.
00:38:57.760 | So it will remove that from the feed.
00:38:59.760 | So you could be in this conference and it picks up the primary speaker for the device like incredibly well.
00:39:04.760 | At the moment, they're the only ones that I'm aware that know how to do that and do it that well.
00:39:09.760 | But it's -- I mean, it's phenomenal.
00:39:10.760 | But you're right.
00:39:11.760 | The VAD is -- I mean, it was one of the biggest problems that we saw until we found Crisp and they're fantastic.
00:39:17.760 | Is that C-R-I-S-P?
00:39:18.760 | K-R-I-S-P.
00:39:19.760 | K-R-I-S-P.
00:39:20.760 | So big props to the Crisp team.
00:39:24.760 | Which speech?
00:39:25.760 | What do you guys use right now?
00:39:29.760 | We use all of them.
00:39:30.760 | PipeCat's open source.
00:39:31.760 | So we bring -- I think options are Gemini, Multimodal Live, OpenAI Realtime, and then AWS just launched a new one called Novosonic.
00:39:39.760 | So we have those three within PipeCat.
00:39:43.760 | They're all pretty -- they're all very much -- actually, well, they're all very similar.
00:39:55.760 | I'm not going to -- I don't want to, like, nitpick on all of the vendors here.
00:39:59.760 | But they're -- I mean, they're -- they have strengths and weaknesses each because it's still an emerging field.
00:40:03.760 | But they're -- latency-wise, that is not an issue.
00:40:08.760 | Latency is fantastic for all the providers.
00:40:10.760 | Yeah.
00:40:11.760 | All right.
00:40:14.760 | Okay.
00:40:15.760 | Maybe Alesh will walk through a quick start.
00:40:19.760 | Why don't we -- he's going to just do some live coding here, and maybe this will help to understand how this all works.
00:40:24.760 | And then perhaps we -- I stop chatting, and maybe you could grab me if you have questions and just do some heads-down working time.
00:40:31.760 | Okay.
00:40:32.760 | Yeah, I cannot type, and --
00:40:37.760 | Teamwork.
00:40:37.760 | Teamwork.
00:40:38.760 | Teamwork.
00:40:39.760 | Yeah.
00:40:40.760 | I'll try from the very, very -- from nothing, from scratch.
00:40:45.760 | So this is a Python project.
00:40:48.760 | Actually, let me -- let me try again.
00:40:50.760 | So, yeah.
00:40:51.760 | This is a Python project.
00:40:52.760 | The first thing we'll do is create an environment, like a virtual environment that it's called.
00:40:58.760 | I like to call it .amv.
00:41:00.760 | That's how you create a virtual environment in Python.
00:41:04.760 | A virtual environment will have --
00:41:08.760 | Oh, yeah.
00:41:09.760 | What's that?
00:41:10.760 | Your screen is hard to see.
00:41:11.760 | Oh, it's hard to see.
00:41:12.760 | How do I do that?
00:41:13.760 | Can you change your theme to, like, white?
00:41:14.760 | I guess.
00:41:15.760 | Can you change your theme to, like, white?
00:41:16.760 | I guess.
00:41:17.760 | Can you change your theme to, like, white?
00:41:18.760 | I guess.
00:41:19.760 | Can you change your theme to, like, white?
00:41:21.760 | I guess.
00:41:22.760 | Can you change your theme to, like, white?
00:41:23.760 | I guess.
00:41:24.760 | Can you change your theme to, like, white?
00:41:25.760 | I guess.
00:41:26.760 | Can you change your theme to, like, white?
00:41:28.760 | I guess.
00:41:29.760 | Can you change your theme to, like, white?
00:41:32.760 | I guess.
00:41:33.760 | Can you change your theme to, like, white?
00:41:36.760 | Can you change your theme to, like, white?
00:41:37.760 | Can you change your theme to, like, white?
00:41:38.760 | Can you change your theme to, like, white?
00:41:39.760 | Can you change your theme to, like, white?
00:41:40.760 | Can you change your theme to, like, white?
00:41:41.760 | Can you change your theme to, like, white?
00:41:42.760 | Can you change your theme to, like, white?
00:41:43.760 | Can you change your theme to, like, white?
00:41:44.760 | Can you change your theme to, like, white?
00:41:45.760 | Can you change your theme to, like, white?
00:41:46.760 | Can you change your theme to, like, white?
00:41:47.760 | Can you change your theme to, like, white?
00:41:48.760 | Can you change your theme to, like, white?
00:41:49.760 | Can you change your theme to, like, white?
00:41:50.760 | Can you change your theme to, like, white?
00:41:51.760 | Can you change your theme to, like, white?
00:41:52.760 | Can you change your theme to, like, white?
00:41:53.760 | Can you change your theme to, like, white?
00:41:54.760 | Can you change your theme to, like, white?
00:41:55.760 | Can you change your theme to, like, white?
00:41:56.760 | Can you change your theme to, like, white?
00:41:57.760 | Can you change your theme to, like, white?
00:41:58.760 | Can you change your theme to, like, white?
00:41:59.760 | Can you change your theme to, like, white?
00:42:00.760 | Can you change your theme to, like, white?
00:42:01.760 | Can you change your theme to, like, white?
00:42:02.760 | Can you change your theme to, like, white?
00:42:03.760 | Can you change your theme to, like, white?
00:42:04.760 | Can you change your theme to, like, white?
00:42:05.760 | Can you change your theme to, like, white?
00:42:06.760 | Can you change your theme to, like, white?
00:42:07.760 | Can you change your theme to, like, white?
00:42:08.760 | Can you change your theme to, like, white?
00:42:09.760 | Can you change your theme to, like, white?
00:42:10.760 | Can you change your theme to, like, white?
00:42:11.760 | Can you change your theme to, like, white?
00:42:12.760 | Can you change your theme to, like, white?
00:42:13.760 | Can you change your theme to, like, white?
00:42:14.760 | Can you change your theme to, like, white?
00:42:15.760 | Can you change your theme to, like, white?
00:42:16.760 | Can you change your theme to, like, white?
00:42:17.760 | Can you change your theme to, like, white?
00:42:18.760 | Can you change your theme to, like, white?
00:42:19.760 | Can you change your theme to, like, white?
00:42:20.760 | Can you change your theme to, like, white?
00:42:21.760 | Can you change your theme to, like, white?
00:42:22.760 | Can you change your theme to, like, white?
00:42:23.760 | Can you change your theme to, like, white?
00:42:24.760 | Can you change your theme to, like, white?
00:42:25.760 | Can you change your theme to, like, white?
00:42:26.760 | Can you change your theme to, like, white?
00:42:27.760 | Can you change your theme to, like, white?
00:42:28.760 | Can you change your theme to, like, white?
00:42:29.760 | Can you change your theme to, like, white?
00:42:30.760 | Can you change your theme to, like, white?
00:42:31.760 | Can you change your theme to, like, white?
00:42:32.760 | Can you change your theme to, like, white?
00:42:33.760 | Can you change your theme to, like, white?
00:42:34.760 | Can you change your theme to, like, white?
00:42:35.760 | Can you change your theme to, like, white?
00:42:36.760 | Can you change your theme to, like, white?
00:42:37.760 | Can you change your theme to, like, white?
00:42:38.760 | Can you change your theme to, like, white?
00:42:39.760 | Can you change your theme to, like, white?
00:42:40.760 | Can you change your theme to, like, white?
00:42:41.760 | Can you change your theme to, like, white?
00:42:42.760 | Can you change your theme to, like, white?
00:42:43.760 | Can you change your theme to, like, white?
00:42:44.760 | Can you change your theme to, like, white?
00:42:45.760 | Can you change your theme to, like, white?
00:42:46.760 | Can you change your theme to, like, white?
00:42:47.760 | Can you change your theme to, like, white?
00:42:48.760 | Can you change your theme to, like, white?
00:42:49.760 | Can you change your theme to, like, white?
00:42:50.760 | Can you change your theme to, like, white?
00:42:51.760 | Can you change your theme to, like, white?
00:42:52.760 | Can you change your theme to, like, white?
00:42:53.760 | Can you change your theme to, like, white?
00:42:54.760 | Can you change your theme to, like, white?
00:42:55.760 | Can you change your theme to, like, white?
00:42:56.760 | Can you change your theme to, like, white?
00:42:57.760 | Can you change your theme to, like, white?
00:42:58.760 | Can you change your theme to, like, white?
00:42:59.760 | Can you change your theme to, like, white?
00:43:00.760 | Can you change your theme to, like, white?
00:43:01.760 | Can you change your theme to, like, white?
00:43:02.760 | Can you change your theme to, like, white?
00:43:03.760 | Can you change your theme to, like, white?
00:43:04.760 | Can you change your theme to, like, white?
00:43:05.760 | Can you change your theme to, like, white?
00:43:06.760 | Can you change your theme to, like, white?
00:43:07.760 | Can you change your theme to, like, white?
00:43:08.760 | Can you change your theme to, like, white?
00:43:09.760 | Can you change your theme to, like, white?
00:43:10.760 | Can you change your theme to, like, white?
00:43:11.760 | Can you change your theme to, like, white?
00:43:12.760 | Can you change your theme to, like, white?
00:43:13.760 | Can you change your theme to, like, white?
00:43:14.760 | Can you change your theme to, like, white?
00:43:15.760 | Can you change your theme to, like, white?
00:43:16.760 | Can you change your theme to, like, white?
00:43:17.760 | Can you change your theme to, like, white?
00:43:18.760 | Can you change your theme to, like, white?
00:43:19.760 | Can you change your theme to, like, white?
00:43:20.760 | Can you change your theme to, like, white?
00:43:21.760 | Can you change your theme to, like, white?
00:43:22.760 | Can you change your theme to, like, white?
00:43:23.760 | Can you change your theme to, like, white?
00:43:24.760 | Can you change your theme to, like, white?
00:43:25.760 | Can you change your theme to, like, white?
00:43:26.760 | Can you change your theme to, like, white?
00:43:27.760 | Can you change your theme to, like, white?
00:43:28.760 | Can you change your theme to, like, white?
00:43:29.760 | Can you change your theme to, like, white?
00:43:30.760 | Can you change your theme to, like, white?
00:43:31.760 | Can you change your theme to, like, white?
00:43:32.760 | Can you change your theme to, like, white?
00:43:33.760 | Can you change your theme to, like, white?
00:43:34.760 | Can you change your theme to, like, white?
00:43:35.760 | Can you change your theme to, like, white?
00:43:36.760 | Can you change your theme to, like, white?
00:43:37.760 | Can you change your theme to, like, white?
00:43:38.760 | Can you change your theme to, like, white?
00:43:39.760 | Can you change your theme to, like, white?
00:43:40.760 | Can you change your theme to, like, white?
00:43:41.760 | Can you change your theme to, like, white?
00:43:42.760 | Can you change your theme to, like, white?
00:43:43.760 | Can you change your theme to, like, white?
00:43:44.760 | Can you change your theme to, like, white?
00:43:45.760 | Can you change your theme to, like, white?
00:43:46.760 | Can you change your theme to, like, white?
00:43:47.760 | Can you change your theme to, like, white?
00:43:48.760 | Can you change your theme to, like, white?
00:43:49.760 | Can you change your theme to, like, white?
00:43:50.760 | Can you change your theme to, like, white?
00:43:51.760 | Can you change your theme to, like, white?
00:43:52.760 | Can you change your theme to, like, white?
00:43:53.760 | Can you change your theme to, like, white?
00:43:54.760 | Can you change your theme to, like, white?
00:43:55.760 | Can you change your theme to, like, white?
00:43:56.760 | Can you change your theme to, like, white?
00:43:57.760 | Can you change your theme to, like, white?
00:43:58.760 | Can you change your theme to, like, white?
00:43:59.760 | Can you change your theme to, like, white?
00:44:00.760 | Can you change your theme to, like, white?
00:44:01.760 | Can you change your theme to, like, white?
00:44:02.760 | Can you change your theme to, like, white?
00:44:03.760 | Can you change your theme to, like, white?
00:44:04.760 | Can you change your theme to, like, white?
00:44:05.760 | Can you change your theme to, like, white?
00:44:06.760 | Can you change your theme to, like, white?
00:44:07.760 | Can you change your theme to, like, white?
00:44:08.760 | Can you change your theme to, like, white?
00:44:09.760 | Can you change your theme to, like, white?
00:44:10.760 | Can you change your theme to, like, white?
00:44:11.760 | Can you change your theme to, like, white?
00:44:12.760 | Can you change your theme to, like, white?
00:44:13.760 | Can you change your theme to, like, white?
00:44:14.760 | Can you change your theme to, like, white?
00:44:15.760 | Can you change your theme to, like, white?
00:44:16.760 | Can you change your theme to, like, white?
00:44:17.760 | Can you change your theme to, like, white?
00:44:18.760 | Can you change your theme to, like, white?
00:44:19.760 | Can you change your theme to, like, white?
00:44:20.760 | Can you change your theme to, like, white?
00:44:21.760 | Can you change your theme to, like, white?
00:44:22.760 | Can you change your theme to, like, white?
00:44:23.760 | Can you change your theme to, like, white?
00:44:24.760 | Can you change your theme to, like, white?
00:44:25.760 | Can you change your theme to, like, white?
00:44:26.760 | Can you change your theme to, like, white?
00:44:27.760 | Can you change your theme to, like, white?
00:44:28.760 | Can you change your theme to, like, white?
00:44:29.760 | Can you change your theme to, like, white?
00:44:30.760 | Can you change your theme to, like, white?
00:44:31.760 | Can you change your theme?
00:44:32.760 | Can you change your theme to, like, white?
00:44:33.760 | Can you change your theme to, like, white?
00:44:34.760 | Can you change your theme to, like, white?
00:44:35.760 | Can you change your theme to, like, white?
00:44:36.760 | Can you change your theme to, like, white?
00:44:37.760 | Can you change your theme to, like, white?
00:44:38.760 | Can you change your theme to, like, white?
00:44:39.760 | Can you change your theme to, like, white?
00:44:40.760 | Can you change your theme to, like, white?
00:44:41.760 | Can you change your theme to, like, white?
00:44:42.760 | Can you change your theme to, like, white?
00:44:43.760 | Can you change your theme to, like, white?
00:44:44.760 | Can you change your theme to, like, white?
00:44:45.760 | Can you change your theme to, like, white?
00:44:46.760 | Can you change your theme to, like, white?
00:44:47.760 | Can you change your theme to, like, white?
00:44:48.760 | Can you change your theme to, like, white?
00:44:49.760 | Can you change your theme to, like, white?
00:44:50.760 | Can you change your theme to, like, white?
00:44:51.760 | Can you change your theme to, like, white?
00:44:52.760 | Can you change your theme to, like, white?
00:44:53.760 | Can you change your theme to, like, white?
00:44:54.760 | Can you change your theme to, like, white?
00:44:55.760 | Can you change your theme to, like, white?
00:44:56.760 | Can you change your theme to, like, white?
00:44:57.760 | Can you change your theme to, like, white?
00:44:58.760 | Can you change your theme to, like, white?
00:44:59.760 | Can you change your theme to, like, white?
00:45:00.760 | Can you change your theme to, like, white?
00:45:01.760 | Can you change your theme to, like, white?
00:45:02.760 | Can you change your theme to, like, white?
00:45:03.760 | Can you change your theme to, like, white?
00:45:04.760 | Can you change your theme to, like, white?
00:45:05.760 | Can you change your theme to, like, white?
00:45:06.760 | Can you change your theme to, like, white?
00:45:07.760 | Can you change your theme to, like, white?
00:45:08.760 | Can you change your theme to, like, white?
00:45:09.760 | Can you change your theme to, like, white?
00:45:10.760 | Can you change your theme to, like, white?
00:45:11.760 | Can you change your theme to, like, white?
00:45:12.760 | Can you change your theme to, like, white?
00:45:13.760 | Can you change your theme to, like, white?
00:45:14.760 | Can you change your theme to, like, white?
00:45:15.760 | Can you change your theme to, like, white?
00:45:16.760 | Can you change your theme to, like, white?
00:45:17.760 | Can you change your theme to, like, white?
00:45:18.760 | Can you change your theme to, like, white?
00:45:19.760 | Can you change your theme to, like, white?
00:45:20.760 | Can you change your theme to, like, white?
00:45:21.760 | Can you change your theme to, like, white?
00:45:22.760 | Can you change your theme to, like, white?
00:45:23.760 | Can you change your theme to, like, white?
00:45:24.760 | Can you change your theme to, like, white?
00:45:25.760 | Can you change your theme to, like, white?
00:45:26.760 | Can you change your theme to, like, white?
00:45:27.760 | Can you change your theme to, like, white?
00:45:28.760 | Can you change your theme to, like, white?
00:45:29.760 | Can you change your theme to, like, white?
00:45:30.760 | Can you change your theme to, like, white?
00:45:31.760 | Can you change your theme to, like, white?
00:45:32.760 | Can you change your theme to, like, white?
00:45:33.760 | Can you change your theme to, like, white?
00:45:34.760 | Can you change your theme to, like, white?
00:45:35.760 | Can you change your theme to, like, white?
00:45:36.760 | Can you change your theme to, like, white?
00:45:37.760 | Can you change your theme to, like, white?
00:45:38.760 | Can you change your theme to, like, white?
00:45:39.760 | Can you change your theme to, like, white?
00:45:40.760 | Can you change your theme to, like, white?
00:45:41.760 | Can you change your theme to, like, white?
00:45:42.760 | Can you change your theme to, like, white?
00:45:43.760 | Can you change your theme to, like, white?
00:45:44.760 | Can you change your theme to, like, white?
00:45:45.760 | Can you change your theme to, like, white?
00:45:46.760 | Can you change your theme to, like, white?
00:45:47.760 | Can you change your theme to, like, white?
00:45:48.760 | Can you change your theme to, like, white?
00:45:49.760 | Can you change your theme to, like, white?
00:45:50.760 | Can you change your theme to, like, white?
00:45:51.760 | Can you change your theme to, like, white?
00:45:52.760 | Can you change your theme to, like, white?
00:45:53.760 | Can you change your theme to, like, white?
00:45:54.760 | Can you change your theme to, like, white?
00:45:55.760 | Can you change your theme to, like, white?
00:45:56.760 | Can you change your theme to, like, white?
00:45:57.760 | Can you change your theme to, like, white?
00:45:58.760 | Can you change your theme to, like, white?
00:45:59.760 | Can you change your theme to, like, white?
00:46:00.760 | Can you change your theme to, like, white?
00:46:01.760 | Can you change your theme to, like, white?
00:46:02.760 | Can you change your theme to, like, white?
00:46:03.760 | Can you change your theme to, like, white?
00:46:04.760 | Can you change your theme to, like, white?
00:46:05.760 | Can you change your theme to, like, white?
00:46:06.760 | Can you change your theme to, like, white?
00:46:07.760 | Can you change your theme to, like, white?
00:46:08.760 | Can you change your theme to, like, white?
00:46:09.760 | Can you change something?
00:46:10.760 | Oh, yeah.
00:46:11.760 | And we also need a VAT analyzer, which is gonna be a Cilero VAT analyzer.
00:46:19.760 | So the transport is gonna be able to use this VAT analyzer to detect if the user has spoken or not.
00:46:28.760 | What's that?
00:46:29.760 | Oh, yeah.
00:46:31.760 | And this is gonna be AI engineer.
00:46:34.760 | There we go.
00:46:36.760 | Okay, now we have the transport.
00:46:38.760 | Now we're gonna create the LLM.
00:46:41.760 | In this case, it's Gemini Live, so I don't need to create a speech-to-text or text-to-speech.
00:46:49.760 | We just create the Gemini Live.
00:46:52.760 | And for that, I'll need to copy it, 'cause I don't know that from memory.
00:46:56.760 | But, oops.
00:46:59.760 | That's in this file.
00:47:05.760 | GeminiBot.
00:47:06.760 | I just wanna copy these lines here.
00:47:09.760 | There we go.
00:47:10.760 | All right.
00:47:11.760 | So this is my LLM.
00:47:15.760 | And, again, it uses this Gemini Multimodal Live LLM service.
00:47:19.760 | Gemini Multimodal Live LLM service.
00:47:24.760 | That's gonna add my import.
00:47:26.760 | Okay.
00:47:27.760 | And now it needs a couple of things.
00:47:29.760 | The system instruction, which is like what the agent is gonna do, and some tools.
00:47:35.760 | The tools, I'm gonna skip them for now.
00:47:38.760 | So let's do the system instruction.
00:47:41.760 | Again, I'm gonna copy it from somewhere.
00:47:43.760 | There it is.
00:47:44.760 | This is like the prompt.
00:47:45.760 | This is like the prompt.
00:47:46.760 | Like the main prompt of the, well, yeah.
00:47:51.760 | All right.
00:47:52.760 | So the system instruction is you're a helpful assistant who can answer questions and use tools.
00:47:57.760 | For now, we're not gonna use any tools.
00:47:59.760 | You know what?
00:48:00.760 | Let me get rid of the tools.
00:48:04.760 | Just copy here for later.
00:48:07.760 | I'm gonna command this out.
00:48:10.760 | And you are just the helpful assistant.
00:48:12.760 | Okay.
00:48:13.760 | So that's...
00:48:16.760 | All right.
00:48:19.760 | So no complaints here.
00:48:21.760 | All right.
00:48:22.760 | And now we just create the pipeline.
00:48:26.760 | I'm gonna avoid storing the context because I don't think we need it for now.
00:48:34.760 | And this is the pipeline.
00:48:35.760 | The pipeline just receives a list of processors or elements.
00:48:41.760 | And the first one is a transport.input.
00:48:44.760 | That's the input transport.
00:48:46.760 | So how we get audio from the daily room in this case.
00:48:51.760 | The LLM.
00:48:52.760 | And the transport.output.
00:48:55.760 | All right.
00:49:00.760 | Now we need...
00:49:01.760 | This is just defines...
00:49:02.760 | A pipeline also is another processor.
00:49:04.760 | So you could build a pipeline of pipelines of pipelines of pipelines of pipelines.
00:49:08.760 | So you can build...
00:49:09.760 | Or you can plug and play the way you like that.
00:49:13.760 | So how do you run a pipeline?
00:49:15.760 | You need a task.
00:49:16.760 | What we call a pipeline task.
00:49:18.760 | That receives a pipeline.
00:49:21.760 | And the pipeline task also has some params.
00:49:25.760 | Which are called pipeline params.
00:49:28.760 | And we're gonna say...
00:49:33.760 | Oops.
00:49:34.760 | That we allow interruptions.
00:49:37.760 | And I think that's enough.
00:49:38.760 | And how do you run a task?
00:49:46.760 | You can create more than one pipeline task if you wanted.
00:49:49.760 | In this case we just have one.
00:49:50.760 | Usually you just have one.
00:49:52.760 | You're gonna create a runner.
00:49:55.760 | And guess what?
00:49:56.760 | It's called pipeline runner.
00:49:58.760 | It's a pipeline runner.
00:49:59.760 | And then we just do...
00:50:01.760 | Await.
00:50:02.760 | Runner.
00:50:04.760 | Task.
00:50:05.760 | And some completions, please.
00:50:06.760 | Pipeline.
00:50:07.760 | Runner.
00:50:08.760 | All right.
00:50:09.760 | And I think that's it.
00:50:10.760 | We'll try it.
00:50:13.760 | Import OS.
00:50:14.760 | I think.
00:50:15.760 | I think there's no more warnings.
00:50:29.760 | And I need to load the environment variables.
00:50:32.760 | Which is this line here.
00:50:34.760 | Load.end.
00:50:35.760 | I'm just copying it from another file.
00:50:38.760 | Okay.
00:50:43.760 | And where do we get load.end?
00:50:44.760 | This is just a function that imports the environment variable.
00:50:54.760 | All right.
00:50:55.760 | And yeah.
00:50:57.760 | Let's try.
00:50:58.760 | I'm just gonna open the, oops.
00:51:03.760 | I'm just gonna open the terminal here.
00:51:08.760 | And I'm just gonna run it.
00:51:10.760 | I think we call it bot.py.
00:51:13.760 | No model.
00:51:16.760 | Maybe I need to install the requirements.
00:51:19.760 | I forgot this step.
00:51:20.760 | There it is.
00:51:27.760 | Okay.
00:51:28.760 | So at the beginning I wrote that file requirements.txt.
00:51:33.760 | Which has had a bunch of, well, I got just a few requirements.
00:51:38.760 | But I forgot to, to install them.
00:51:49.760 | In the meantime, I'm just gonna go to the daily room that I just pointed the bot to.
00:51:57.760 | No video.
00:52:09.760 | Okay.
00:52:10.760 | So that's, right now it's just me in that room.
00:52:16.760 | So, and now we just have to wait for this to, to finish.
00:52:21.760 | And hopefully the bot will join the room and we'll be able to talk to it.
00:52:26.760 | Hopefully.
00:52:27.760 | Yeah.
00:52:32.760 | How it forces, like daily as part of the splutter.
00:52:36.760 | Like that room.
00:52:38.760 | Like it's an interchangeable.
00:52:40.760 | It is in the, yeah, it's, this is because we're using the daily transport.
00:52:45.760 | And the daily transport just connects to a daily room.
00:52:48.760 | So, but you could have a, a web socket transport.
00:52:51.760 | And then use Twilio with a phone number.
00:52:54.760 | And Twilio being connected to that.
00:52:56.760 | If we have time, we can even try that.
00:52:58.760 | Um, so I think.
00:53:00.760 | You wanna, you wanna talk?
00:53:07.760 | Yeah.
00:53:08.760 | Yeah.
00:53:09.760 | Yeah.
00:53:10.760 | I'll wait.
00:53:11.760 | So we also, uh, in pipe cat, we also have added, um, based off of the AIO RTC Python package,
00:53:17.760 | which is how, uh, WebRTC package in Python.
00:53:20.760 | We've added a, a new transport called small WebRTC transport.
00:53:24.760 | It is a peer to peer WebRTC communication that's free.
00:53:27.760 | So it's separate from any vendor.
00:53:29.760 | Uh, though the one downside is that it requires a turn server, which we, you bring your own.
00:53:35.760 | So we, we didn't, you know, we weren't prepared for that for the conference.
00:53:38.760 | And also just the conference wifi makes that a little challenging.
00:53:40.760 | But normally if you're running any of the, any of the, uh, we call them foundational examples
00:53:45.760 | in pipe cat.
00:53:46.760 | Think of them as the like essential, um, examples that show how to do very specific functions.
00:53:52.760 | There's probably about a hundred of them in pipe cat, but one by one, it shows you how
00:53:55.760 | to like record or add an STT or push frames or show images or sync images and, and sound.
00:54:02.760 | Those all use the peer to peer WebRTC transport.
00:54:05.760 | So we, we would have loved to have used that.
00:54:07.760 | You wouldn't need a key, but unfortunately firewall rules have trumped.
00:54:11.760 | You wouldn't need a key.
00:54:13.760 | All right.
00:54:14.760 | Uh, so I'm just running the bot and see how it fails because it has to fail the first time.
00:54:26.760 | Yeah.
00:54:27.760 | This is the mass there.
00:54:28.760 | Cause there's a bunch of the, um, Python packages and Python just decides to take a time to load
00:54:36.760 | them.
00:54:37.760 | But you see how easy it was to write, um, like an agent, like a voice agent with Gemini
00:54:46.760 | live.
00:54:47.760 | Uh, if it worked, it'd be just a few lines of code, uh, that we wrote in, I don't know how
00:54:53.760 | long it took me, but, uh, maybe like five, 10 minutes.
00:54:57.760 | Um, yeah.
00:54:58.760 | Are there any questions on the example or what's that?
00:55:04.760 | It worked.
00:55:05.760 | It worked?
00:55:06.760 | Yeah.
00:55:07.760 | What worked?
00:55:08.760 | The bot worked?
00:55:09.760 | Yeah.
00:55:10.760 | Okay.
00:55:11.760 | All righty.
00:55:12.760 | All right.
00:55:13.760 | Nice.
00:55:14.760 | Yeah.
00:55:15.760 | There's a couple of things that popped up there with the words later in it.
00:55:29.760 | We have, I don't know the actual number of customers, but I mean, pipe cat probably serves hundreds
00:55:38.760 | of thousands of calls a day.
00:55:39.760 | I don't know.
00:55:40.760 | Well, a lot.
00:55:41.760 | Quinn, you probably have a better idea.
00:55:42.760 | Oh yeah.
00:55:43.760 | I mean, pipe cat is made by some very large companies in production, uh, and people are
00:55:48.760 | contributing to it from Nvidia, AWS, uh, open AI, Google, uh, lots of, lots of big companies.
00:55:54.760 | Yeah.
00:55:55.760 | There's one thing we didn't mention about pipe cat is that what you see now in the screen,
00:56:01.760 | these runs on the server side, but we do have client as the case for Android, iOS, JavaScript,
00:56:08.760 | and React, and I think that's about it, but, uh, even a, a C++ client, um, uh, if you want.
00:56:20.760 | Um, so yeah.
00:56:21.760 | So that's the server side, but you can plug your, your client and connect to the, to the agent
00:56:27.760 | on your, on your phone.
00:56:29.760 | Uh, that would be, it, that depends on the, uh, transfer you use, but yeah, you could,
00:56:36.760 | you could have your client connect to a daily or we support life kit as well, but to a daily
00:56:42.760 | room, we like daily because we are working daily, but, uh, you connect, you can connect
00:56:47.760 | to a daily room and then the bot would connect or the agent would connect to the daily room as
00:56:52.760 | well.
00:56:53.760 | And then that's the transport, the web RTC transport.
00:56:55.760 | Yeah.
00:56:56.760 | I think there were questions there.
00:57:00.760 | I can see.
00:57:01.760 | Uh, say it again.
00:57:18.760 | Um, yes, yes, there is, um, actually, uh, for the previous version, um, I just hacked
00:57:28.760 | together a thing called, uh, that I call release evals, which is a bot talking to a bot.
00:57:33.760 | And what it does is I put this bot, uh, up and then it joins a daily room and then I have,
00:57:40.760 | uh, an eval bot and the eval bot, um, what he's going to do, it has a prompt, which is ask a
00:57:47.760 | simple addition.
00:57:48.760 | Okay.
00:57:49.760 | And then that eval bot is going to connect to the room.
00:57:52.760 | He's going to add what is two plus two.
00:57:54.760 | And then the other bot is going to reply two plus two is four.
00:57:58.760 | And the eval bot, the LLM, uh, it checks if the answer of the user is correct.
00:58:06.760 | And the user in this case is another LLM.
00:58:08.760 | So it verifies, it's like an end to end.
00:58:11.760 | The good thing is we, we run the, we used to run like more than a hundred examples every
00:58:17.760 | release just to make sure they work.
00:58:19.760 | So I just got tired of it cause it's very painful and very slow.
00:58:23.760 | So we have these, uh, eval bot or, uh, release evals that are gonna test each service.
00:58:31.760 | Like we test Gemini live, we test Cartesia, deep gram, like all the services like end to
00:58:38.760 | And then the bots basically talk to each other with voice.
00:58:41.760 | So that's the, that's the, the nice thing.
00:58:43.760 | So yeah.
00:58:44.760 | Okay.
00:58:45.760 | Maybe.
00:58:46.760 | Oh, is this on?
00:58:47.760 | Maybe real quick just to show for those that didn't see it, it is.
00:58:49.760 | Hey, can you tell me a joke?
00:58:54.760 | Why don't scientists trust atoms?
00:58:56.760 | Because they make up everything.
00:58:58.760 | Which, if you build pipe cat, I've heard that joke probably, I mean, not like 5,000 times.
00:59:06.760 | Like you can even try to seed it with something different, but it's still-
00:59:09.760 | I will try to come up with a new joke.
00:59:10.760 | Yeah.
00:59:11.760 | Why do, why do you always say the same joke?
00:59:16.760 | That's an interesting question.
00:59:18.760 | Why do you think people keep telling the same jokes?
00:59:21.760 | I don't know.
00:59:22.760 | You tell me.
00:59:27.760 | From my search, there appear to be several reasons why people repeat jokes.
00:59:31.760 | Enjoyment and reinforcement.
00:59:33.760 | People repeat things they find funny because they want to re-experience the good feeling associated
00:59:37.760 | with laughter.
00:59:38.760 | It feels good.
00:59:39.760 | And repeating it is a way to try and recreate that sensation.
00:59:43.760 | Memory aid.
00:59:44.760 | Okay, okay, that's enough.
00:59:45.760 | Repeating something can help you remember.
00:59:46.760 | Yeah.
00:59:47.760 | Is there anything else I can help you with today?
00:59:50.760 | No, thank you.
00:59:51.760 | That was it.
00:59:52.760 | Thanks.
00:59:53.760 | Thank you so much.
00:59:54.760 | Well, just to show there was a question about interruptions.
00:59:55.760 | We could just have it, like, my favorite is to ask it to tell you like a really long story
00:59:59.760 | and then interrupt it.
01:00:00.760 | So can you tell me a really long story?
01:00:02.760 | Okay, I can do that.
01:00:07.760 | Here's a story.
01:00:08.760 | And feel free to interrupt whenever you like.
01:00:10.760 | Once upon a time.
01:00:11.760 | Okay, actually tell me that new joke.
01:00:12.760 | Hey, tell me that new joke.
01:00:12.760 | Why don't scientists trust atoms?
01:00:13.760 | Because they make up everything.
01:00:14.760 | Okay, here's one.
01:00:15.760 | That was the same joke.
01:00:16.760 | That was the same joke.
01:00:17.760 | All right.
01:00:18.760 | All right.
01:00:19.760 | That was it.
01:00:21.760 | Well, I find, like, a lot of conversations, like, it's like a child's based on when you pause
01:00:38.760 | talking.
01:00:40.760 | So I think the questions asked during this workshop could map out, like, years of work.
01:00:57.760 | So this is, like, another one of those fantastic cutting edge things.
01:01:01.760 | So, again, back to, like, human evolution, we all know, and when we talk, actually, it's
01:01:06.760 | even hard for humans to talk to not speak over each other.
01:01:09.760 | So the way that it works mechanically is when the user stops speaking, the VAD has a timeout.
01:01:14.760 | You tell it and program it, wait, let's say, one second, 0.8 seconds, half second, whatever
01:01:19.760 | feels natural.
01:01:20.760 | And you're trying to balance low latency response with giving the user enough time to speak.
01:01:25.760 | It's a really hard thing.
01:01:26.760 | And it's one of the biggest complaints is that agents will speak over the human.
01:01:30.760 | So if you're, let's say you're building an interview bot, like, you're using, like, Tavis, one of their
01:01:35.760 | digital twins, you want to have, like, a real, like, likeness, and you want to speak to it.
01:01:40.760 | You may take time to think, because sometimes you have to take time to think.
01:01:44.760 | And that's a really difficult thing for bots to do, because, again, it's driven by, like, a simple stop-speaking
01:01:49.760 | algorithm.
01:01:50.760 | So this is a new, like I said, it's like an emerging field of models, which is looking at semantic end of turn.
01:01:58.760 | So driven off of things like speech filler words, pauses, intonation, so things in the audio realm, and also things in the text-based realm, so just looking at context.
01:02:11.760 | So we've actually started, we're one of many that are doing this, I think, but we launched a model.
01:02:17.760 | If you look at it on GitHub, it's under smart dash turn.
01:02:20.760 | It's a native audio in classifier that runs an inference on the input audio, and it simply outputs either complete or incomplete.
01:02:30.760 | And the way PipeCat uses this is that if you get an incomplete response, we can dynamically adjust the VAD timeout.
01:02:36.760 | So we can tell the PipeCat bot, okay, he or she is not done speaking.
01:02:41.760 | Let's actually move the, let's give three seconds to complete the thought.
01:02:45.760 | And if it's not done, then the bot will actually respond.
01:02:47.760 | So you can create a little bit of, like, dynamic interaction there.
01:02:51.760 | And that's one of the first things.
01:02:52.760 | I'm sure the Google team is working on similar things.
01:02:55.760 | I know OpenAI is, and all the SDT vendors are also looking at their own things.
01:03:00.760 | So I'd say, right now, it is very much an unsolved problem.
01:03:03.760 | But I would imagine, given how fast things are going in the next 12 months, we'll have great solutions that will make it even more natural to talk to a bot.
01:03:11.760 | It's a good question.
01:03:16.760 | Any more questions?
01:03:21.760 | Well, actually, I was kind of wondering, like, is there a way to see the transcripts of that happening?
01:03:31.760 | Oh, yeah.
01:03:32.760 | Yeah, yeah.
01:03:34.760 | One, this is actually back to the, well, for, this is specific to PipeCat, but also, like, Gemini Live will output audio and text.
01:03:42.760 | And other speech-to-speech LLMs do this.
01:03:45.760 | PipeCat offers, in terms of its, again, orchestration role, when you get a, actually, it's going to be specific to TTS provider.
01:03:55.760 | Many, there are great TTS providers that do word and timestamp synchronization.
01:03:59.760 | So they'll give pairs.
01:04:00.760 | They call them, like, alignment pairs.
01:04:02.760 | So if you're using a Cartesia or an Eleven Labs or Rhyme, they all output these pairs.
01:04:08.760 | One of the really cool things with PipeCat is that the TTS services output not only the audio stream, but also the text stream.
01:04:15.760 | So they'll output text frames, TTS text frames, we call them in PipeCat.
01:04:19.760 | And if you place, we have, in terms of how the client software works, there is, like, an observer role, where you can actually watch.
01:04:27.760 | There's a process that can watch things that happen in the pipeline and emit events.
01:04:30.760 | So we've instrumented that for the clients so that whenever you see those text frames move through the transport, you can get synchronized word and audio output.
01:04:39.760 | So in your client, if you wanted to have word-by-word output synchronized to the audio, you can do that with PipeCat.
01:04:44.760 | And it's as simple as just adding an event.
01:04:47.760 | I think you listen to, like, bot TTS text output or on bot TTS text, and it will give you the synchronized output.
01:04:53.760 | If I wanted to build a fully offline voice engine, what box in that pipeline would be hardest to do?
01:05:05.760 | Fully offline? Well, they're all doable. They're great models.
01:05:08.760 | I think it really depends on what your bot needs to accomplish.
01:05:11.760 | A lot of the state-of-the-art models, to do all the best and smartest things, need to have some.
01:05:16.760 | Like, they're going to be run, like, on-prem or in the cloud.
01:05:20.760 | But if you have -- and a lot of bots do jobs.
01:05:23.760 | Like, if you wanted to build, like, a restaurant reservation one, like I referenced earlier, it's a very simple job.
01:05:28.760 | You could probably run it with some version of Llama running locally.
01:05:32.760 | There are great local -- something, again, Quinn has been experimenting with -- a lot of great local models.
01:05:39.760 | Like, whisper has challenges, you know. I mean, it's -- it has a lot -- it has some challenges as an open source model for STT.
01:05:50.760 | But there are good and emerging TTS services. So, you know, I -- things are only as good as the input.
01:05:56.760 | And we've actually seen this with some of the speech-to-speech models that sometimes they mistranscribe.
01:06:00.760 | So, you really need -- I mean, it's -- you know, every part is critical, but if you can't transcribe the speech really well, nothing really matters.
01:06:06.760 | Like, it has to understand you. And having, like, disfluencies or, like, hallucinated responses or even just inaccurate responses kind of breaks everything down.
01:06:14.760 | So, things mostly start at the STT. So, maybe that's the hardest. I don't know if there are a lot of good open source options for that right now.
01:06:20.760 | I don't know. We don't -- we're not doing anything in the STT world. No. No. No. No. That's a whole different ballgame. Good question, though.
01:06:31.760 | Just made me realize we could have used local models and avoid this.
01:06:36.760 | We could have. Yeah. Well, we're partnering with the Google team.
01:06:40.760 | Yeah. Yeah. As a -- yeah, exactly. Yeah.
01:06:44.760 | Has anyone looked at any of the sample projects and had questions? There's a lot of interesting things that we're doing.
01:06:49.760 | There's a lot of interesting things there. If any of this has, like, interested you, we do have a Discord.
01:06:53.760 | You're welcome to get on it. You can find us at pipecat.ai and find our Discord there.
01:06:59.760 | You can ask questions. There's some really cool stuff with Gemini that can be done.
01:07:04.760 | There -- in particular, in the PipeCat repo, we built -- I don't know if you know the game Catch Phrase,
01:07:09.760 | where you describe a word and something, you know, guesses it. We built a version of that.
01:07:14.760 | We had to brand it something else called Word Wrangler. And you, as the human, you describe a word,
01:07:18.760 | and then you have the AI agent try to answer it. So we built a client-server version of that,
01:07:23.760 | which I linked to in the repo. And then we have one that's a phone-based one that's, I think,
01:07:27.760 | particularly sophisticated and interesting. So you might think, like, how the hell would I build this with a speech-to-speech model?
01:07:33.760 | We actually use two Gemini agents in the same call, and we use a parallel pipeline where one agent is the host giving out the questions to the human user.
01:07:44.760 | The other is the guesser. And we kind of limit the audio flow so that the guesser, the AI player, can only hear the user.
01:07:51.760 | So there's a bunch of really interesting things getting into, like, majorly into the weeds of some of the powers of Pipecat.
01:07:57.760 | But it also speaks to the strength of having just native audio input being really, really helpful.
01:08:01.760 | So I'd recommend checking those out. Really cool, easy demos to run.
01:08:06.760 | One's Twilio. The other is, again, a client-server. I think it's, like, a React Next.js project.
01:08:15.760 | What's that? Word Wrangler? Yeah, I mean, we could run the Word Wrangler client app.
01:08:22.760 | It's actually just on the web. Test.
01:08:27.760 | Welcome to Word Wrangler. I'll try to guess the words you describe.
01:08:32.760 | Remember, don't say any part of the word itself. Ready? Let's go.
01:08:37.760 | I'm going to skip to something easier. Okay, this is something you take pictures with. It's on your phone.
01:08:44.760 | Is it camera?
01:08:48.760 | All right, this is a field related to the study of languages, I think.
01:08:53.760 | Is it linguistics?
01:08:56.760 | All right, this is a game of the yellow ball you play with rackets. Hit the ball over the net.
01:09:00.760 | Is it tennis?
01:09:04.760 | All right, this is a round dessert with chocolate chips sometimes and other fun goodies.
01:09:10.760 | Is it cookie?
01:09:13.760 | It's really good even when I'm bad at giving answers.
01:09:16.760 | So, pretty cool. This is built with Gemini Live.
01:09:19.760 | But, again, just an example of things you can build with voice AI.
01:09:24.760 | So, cool, unique interactions.
01:09:27.760 | All right. I think that's about it. Thanks, everybody.
01:09:33.760 | Thanks, everybody.
01:09:34.760 | Thanks, everybody.
01:09:35.760 | Thanks, everybody.
01:09:36.760 | Thanks, everybody.
01:09:37.760 | Thanks, everybody.
01:09:38.760 | Thanks, everybody.