back to index

Milliseconds to Magic: Real‑Time Workflows using the Gemini Live API and Pipecat


Whisper Transcript | Transcript Only Page

00:00:00.000 | Thao Yeager:
00:00:01.000 | So voice is the most natural of interfaces.
00:00:20.400 | Humans are storytellers, talkers, listeners, conversationalists.
00:00:27.520 | We think out aloud.
00:00:29.660 | We learn to talk before we learn to read.
00:00:33.420 | And most of us talk faster than we type.
00:00:37.920 | We express emotion through our voices,
00:00:40.840 | and we use sound to understand the world around us.
00:00:46.460 | We've been working together for the past few months,
00:00:48.960 | Shritshta, from the angle of models and APIs,
00:00:52.140 | and me from the application layer and agent framework direction.
00:00:55.320 | And I think we both believe that voice
00:00:57.180 | is a critical and universal building block for the whole next generation
00:01:00.940 | of Gen AI, especially at the UI level, but more generally as well.
00:01:04.700 | Those of us who are early adopters of personal voice AI talk to our computers all the time.
00:01:09.460 | We think of the LLMs we talk to as sounding boards and coaches
00:01:12.960 | and interfaces to everything that lives on our devices and in the cloud.
00:01:17.880 | And this is not just an early adopter phenomenon.
00:01:21.640 | Like, we already have voice agents deployed at scale.
00:01:26.340 | Language translation apps that translate between a patient and a doctor.
00:01:30.880 | Directed learning apps that a fourth grader can use to learn a topic they want to.
00:01:36.320 | Speech therapy apps and co-pilots that help people navigate complex enterprise software.
00:01:43.080 | One of the things we see in our work with customers at Daily is it's pretty common for people not to realize that they're talking to a voice agent on a phone call,
00:01:52.080 | even when you tell them at the beginning of the phone call that they're talking to an AI.
00:01:57.080 | Yeah, and kids born today will probably take all of this for granted.
00:02:00.920 | But those of us who are living through this evolution of talking computers, this can sometimes feel like magic.
00:02:08.040 | But of course, anybody who's seen a really great magician prepare a magic trick knows that the magic is just the interface.
00:02:17.640 | There's a lot of hard work that goes into creating that magic trick.
00:02:24.760 | So here's a partial list of the hard things that done right collectively add up to that magic.
00:02:29.160 | So real time responsiveness, which we've all in this whole track all day talked about as the foundation thing you have to get right or voice AI is unworkable.
00:02:37.000 | Through the things that we're just starting to experiment with, like generating dynamic user interface elements for every conversational turn.
00:02:45.400 | These are the things we've been hacking on and thinking about together for the past few months.
00:02:49.600 | And we're not going to go over all of these today, although we did have a little extra time in the session, right, Thor?
00:02:54.400 | Thor said we could talk for like a couple of hours, maybe, but we do have a framework that we thought would be useful to share with you,
00:03:02.000 | a framework that sort of maps onto how we've worked together from the model layer all the way up.
00:03:06.960 | Yeah, and this barely scratches the surface, but here are the layers of the voice AI stack.
00:03:12.880 | So at the bottom, underpinning everything, you have the large language models that frontier labs like DeepMind work on.
00:03:20.480 | Then above that, you have carefully designed, but at this stage, constantly evolving, real-time APIs.
00:03:28.080 | Google's version is called the Gemini Live API.
00:03:31.520 | Above the APIs are the orchestration libraries and frameworks, like PipeCat, that help to manage and abstract the complexity of building these real-time multimodal applications.
00:03:46.080 | And then, of course, at the top of the stack, you have the application code.
00:03:51.040 | For each of the hard things we listed on the previous slide, the code that implements that hard thing lives somewhere in that stack.
00:03:57.680 | So one of the ways we think about this is that there's a map, and you can sort of think about it two-dimensionally, maybe.
00:04:03.760 | There's the, where does the code live that kind of solves the hard problem that you're, you know, you're thinking about as a voice agent developer?
00:04:10.480 | Where in the stack?
00:04:12.000 | And then how mature is our solution to that right now?
00:04:16.160 | Yeah, basically how solved is this thing?
00:04:19.520 | And what we've tried to do here is map all of these various things that you need to get right on a right-to-left axis of maturity.
00:04:28.480 | And there are a couple of things that are kind of top of mind for me about this mapping.
00:04:32.320 | One is that I don't think of any of these things as more than about 50% solved.
00:04:36.240 | Totally arbitrary, like personal thing.
00:04:38.960 | Shrestha and I just argued about it a little bit.
00:04:40.560 | Like what's the right way to represent that on this slide?
00:04:43.360 | But what we're trying to say is basically it's early.
00:04:45.680 | It's early for voice AI.
00:04:46.960 | And there's a lot of work to do at every part of the stack to get to that universal voice UI we're imagining.
00:04:54.960 | Yeah, and secondly, as this technology matures, and we've already seen some of this happening,
00:05:01.200 | the capabilities tend to move down the stack.
00:05:05.600 | So what might happen is in your one-off individual applications, you might write some code to solve a specifically difficult challenge.
00:05:13.360 | Now, if enough people experience that challenge, then that tends to get built into the orchestration libraries and frameworks and then eventually make its way into the APIs.
00:05:26.480 | But independently of all of that, the models themselves are getting more and more generally capable.
00:05:33.680 | I mean, we just talked about semantic voice activity detection in the previous talk.
00:05:38.000 | Yeah, this is like a great follow-on to Tom's talk about turn detection because I think turn detection is a perfect example of this.
00:05:46.400 | So like I built some of the first talk to an LLM voice AI applications a little over two years ago now.
00:05:52.400 | And I tried to solve turn detection right there in the application code because there weren't any tools yet for it.
00:05:58.640 | A few months later, we built what we thought were pretty generalized at the time, state-of-the-art turn detection implementations into PipeCat.
00:06:06.240 | So moved down the layer into the framework.
00:06:08.320 | Now, Shrestha has turn detection in the multimodal live API, sort of inside the surface area of those same APIs that are doing inference and other things for you.
00:06:18.000 | And I think all of us, as Tom said, expect the models over time to just do turn detection for us.
00:06:23.960 | And all those hard things, it varies depending on exactly what you're talking about of that long list we put together on that slide.
00:06:32.120 | But in general, I think everything is moving down the stack.
00:06:35.800 | And then more and more interesting use cases are creating more things to put sort of at the top of the stack.
00:06:40.200 | Yeah, I will say we have server-side turn detection built in, but we also allow you to turn off turn detection and use models like Daily and LiveKit.
00:06:53.320 | So should we start with the demo?
00:06:56.720 | Yeah, we can.
00:06:57.720 | We do have a demo to show you.
00:07:00.040 | And it's sort of a demo of some stuff I've been using in my own life every day for the last year or so.
00:07:06.920 | I've been experimenting with talking to my computer and my phone as much as I can to do various things, as you can imagine because I post about it probably too often on social media.
00:07:17.320 | And one of the things that's become really sticky for me is kind of managing my priorities in like a very loose way with voice AI.
00:07:25.320 | These days, the conversations I have with my phone and my computer often tend to be in this sort of funny and interesting and new gray area for me between like a to-do list and brainstorming or kind of trying to work stuff out.
00:07:37.720 | And because I write this code for myself and I'm always messing with it, this is definitely a cobbler's children never wear shoes situation.
00:07:46.040 | Like I don't have unit tests.
00:07:47.480 | I don't have evals for this code.
00:07:49.480 | I'm always using like Shresta's latest models.
00:07:52.360 | I'm always using some bleeding edge branch or pipe cat.
00:07:55.080 | So what I thought would be fun to do is fire up a version of this code, have Shresta talk to it and see what breaks.
00:08:03.720 | And then maybe we can talk about like where those bugs are and like how I could have done better next time.
00:08:09.080 | Yeah, I will say having played with this, there's quite a lot of variation in how the model behaves.
00:08:16.200 | So we'll see where we get to there.
00:08:18.440 | One thing that I think is interesting is as I write more and more of this code, I keep relearning the lesson that the way the models drive the whole application cycle is very different from my traditional programmers view of things.
00:08:33.480 | So half the things I'm sure you encountered when you were like playing with this are the model did something that my code actually didn't expect.
00:08:39.800 | That might have been bad, but it might actually been good.
00:08:42.600 | And that's a whole sort of adjustment to how we think about these applications.
00:08:45.480 | Yeah, there's definitely pleasant surprises also.
00:08:48.120 | I am ready.
00:08:50.680 | Okay, do I talk?
00:08:54.280 | I think you can just talk and I don't think I can mute it because I don't know how to use the mouse backwards very well.
00:08:59.880 | Okay.
00:09:01.320 | Yeah, you can talk whenever you're ready.
00:09:03.480 | No problem.
00:09:05.400 | I'm here to help whenever you need.
00:09:07.320 | Yep, I can hear you loud and clear.
00:09:09.400 | All right.
00:09:11.800 | Okay.
00:09:13.640 | What can I help you with today?
00:09:16.200 | Can you help me create a grocery list for asparagus pizza?
00:09:21.720 | Got it.
00:09:24.520 | Pizza ingredients are added to your girfie list.
00:09:27.160 | Anything else?
00:09:28.200 | Can you list out the pizza ingredients for asparagus pizza though?
00:09:32.600 | Sure.
00:09:35.640 | I've updated the grocery list to include pizza crust, mozzarella cheese, tomato sauce, and asparagus.
00:09:42.600 | Anything else?
00:09:43.480 | Would you add anything else to the list?
00:09:48.920 | I've added garlic and olive oil to the list.
00:09:54.760 | Does that sound right?
00:09:55.880 | That sounds perfect.
00:09:57.240 | All right.
00:09:57.880 | Now, let's create my reading list.
00:10:01.160 | Great.
00:10:01.480 | Can you create--
00:10:04.360 | Start adding to your reading list.
00:10:05.640 | Okay, great.
00:10:06.360 | Absolutely.
00:10:07.480 | What books would you like to have?
00:10:08.680 | Turn detection, yeah.
00:10:11.080 | Okay, reading list.
00:10:12.360 | Can you add the--
00:10:15.720 | Please add the first book, Dream Count, and can you look up the author?
00:10:19.720 | I've added quick to your reading list.
00:10:21.960 | No, that was not the book I wanted to add.
00:10:24.680 | Please add the first book, Dream Count, and look up the author for me.
00:10:36.840 | Hey, are you there?
00:10:37.560 | Hey, are you there?
00:10:39.800 | Segmentation fault.
00:10:41.080 | Got it.
00:10:44.440 | Segmentation fault is added to your reading list.
00:10:46.920 | What's next?
00:10:48.040 | I guess I'll have to--
00:10:52.440 | It clearly couldn't find the author for that one.
00:10:58.120 | But let's try one last time.
00:11:01.320 | Okay, one more book, Dream Count, and can you look up the author?
00:11:05.400 | Hey, are you there?
00:11:10.520 | Segmentation.
00:11:14.200 | Okay, segmentation has been added to your reading.
00:11:20.120 | It just likes your voice better than mine.
00:11:24.120 | All right, no.
00:11:26.680 | Segmentation fault is great, but I would really like to read Dream Count.
00:11:32.440 | Can you add Dream Count?
00:11:39.240 | All right, hey, are you still there?
00:11:41.080 | Yes, I'm still here.
00:11:44.760 | How can I help you?
00:11:46.360 | Okay, can you add the second and third books of the three-body problem trilogy?
00:11:53.320 | Sure, the second and third-- both segmentation fault and--
00:11:56.920 | While segmentation--
00:11:58.520 | Can you add the second and third books of the three-body--
00:12:01.960 | This will also be a computer science concept.
00:12:03.320 | Is there anything else I can help you with?
00:12:05.320 | I swear it worked much better 10 times last night.
00:12:11.640 | I apologize for my performances in Up to Part today.
00:12:15.720 | Is there anything specific you'd like me to try again?
00:12:18.520 | Well, maybe you can try-- all right, but let's recover here.
00:12:26.600 | Please add the second and third books of the three-body problem trilogy.
00:12:31.560 | You want to add the second and third books of the three-body problem trilogy.
00:12:39.480 | I've added--
00:12:40.600 | I just want to point out that it did look up the names and find it.
00:12:44.440 | Let's see if it'll co-operate it.
00:12:47.800 | This one might be in its training data because it's a bit older,
00:12:54.200 | but the dream count one, which it was spectacular on last night,
00:12:58.680 | it did a Google search because it's a new book.
00:13:00.840 | All right, let's move on.
00:13:03.400 | Can you create a new list for me called work tasks?
00:13:07.400 | So first item, I would like to say, create H2 roadmap by end of day Friday.
00:13:15.800 | Got it.
00:13:18.280 | Create H2 roadmap by end of day Friday is added to your work list.
00:13:22.280 | I don't see it.
00:13:23.160 | Can you show me on the screen?
00:13:24.280 | I've displayed the work list for you.
00:13:28.120 | All right.
00:13:28.760 | You should see it.
00:13:29.000 | Let's add another item.
00:13:31.160 | Okay, what's the next item?
00:13:34.280 | Finish writing podcast script by end of day Thursday.
00:13:37.560 | And can you add the exact date?
00:13:39.000 | Okay, I've added finish writing podcast script by end of day Thursday to your work list.
00:13:46.920 | What's the exact date for end of day Thursday?
00:13:49.480 | Today is Wednesday, June 4th, 2025.
00:13:54.760 | So end of day Thursday is June 5th, 2025.
00:13:58.680 | Can you add that to the list?
00:14:00.120 | Yes, I've added the exact date to the work list.
00:14:05.080 | All right.
00:14:05.720 | It now says finish writing podcast script by end of day Thursday, June 5th, 2025.
00:14:10.760 | Is there anything else you'd like to add to the list?
00:14:13.160 | Yes, actually.
00:14:13.800 | Or anything else?
00:14:14.840 | Yes, I would.
00:14:15.320 | Can you combine my grocery list, my reading list, and my work list and split them into two lists,
00:14:24.360 | personal tasks and work tasks?
00:14:26.440 | I've combined your grocery list, reading list, and work list.
00:14:32.920 | They are now displayed on the screen as separate lists.
00:14:35.560 | Why are garlic and olive oil separate from pizza ingredients?
00:14:41.400 | Can you flesh out the grocery list, please?
00:14:44.120 | You're right.
00:14:47.800 | I've updated the grocery list to include garlic and olive oil as part of the pizza ingredients.
00:14:52.280 | All right.
00:14:52.760 | And can you consolidate the work list?
00:14:55.000 | I see two on the screen.
00:14:56.120 | You're right.
00:14:59.880 | I apologize for the duplication.
00:15:01.880 | I've consolidated the work list to a single entry.
00:15:04.760 | You should now see the updated list on the screen.
00:15:06.760 | Okay.
00:15:08.200 | And then finally, can you do the following?
00:15:11.560 | Can you assign grocery list to Quinn, reading list to Shrestha,
00:15:17.800 | with the exception of segmentation fault, which Quinn will read,
00:15:22.040 | and work list to Shrestha.
00:15:23.720 | I understand you want to assign the grocery list to Quinn,
00:15:28.040 | the reading list to Shrestha, except for segmentation fault, which Quinn will read,
00:15:32.680 | and the work list to Shrestha as well.
00:15:34.520 | Can you show that on the screen?
00:15:35.960 | I've updated the lists to show the assignments you specified.
00:15:42.520 | You should now see the updated lists on the screen.
00:15:44.600 | And you've spelt my name correctly, which was the harder name, but you spelled Quinn incorrectly.
00:15:50.920 | But okay, we're out of time here.
00:15:55.240 | Okay.
00:15:55.720 | So if we want, the final thing here is we would like to have a nice background for this app.
00:16:02.200 | So can you--
00:16:03.880 | I apologize for spelling your name.
00:16:05.320 | Okay.
00:16:06.280 | Could you help us generate-- actually, I'll be more assertive.
00:16:10.920 | Please generate an app that has the words "hello world" jumping up and down on the screen,
00:16:19.720 | with the colors of Google.
00:16:21.640 | And then we also want two neon green ASCII cats, which are animated.
00:16:27.720 | I'm now generating an app with "hello world" jumping up and down in Google colors,
00:16:36.600 | and two animated neon green ASCII cats.
00:16:39.480 | It may take a moment for the app to appear on your screen.
00:16:41.640 | Detention.
00:17:01.400 | So yeah, there you have the Google colors and the cats from PipeCat.
00:17:11.880 | So with that, I'll hand it over to Quinn.
00:17:14.680 | So first, you should go back to your pride of place right in the middle.
00:17:19.400 | Thank you for being such a good sport.
00:17:22.600 | Very messy code on my part, including things like basically no instructions to the LLM about
00:17:28.120 | how to display text on the screen and just telling it it has a function that can display text on the
00:17:33.000 | screen and it sort of guesses and learns in context, as you can tell from Trista, about when it should
00:17:38.360 | clear the screen because there's an optional clear argument to the add text to the screen function.
00:17:43.720 | And it's super impressive, but also super jagged frontier about whether it kind of can intuit what
00:17:49.560 | you want to do in those contexts. So thank you for like doing this because this is what I do all the
00:17:54.840 | time with this code trying to figure out like what these models can do and what kind of code you have
00:17:59.080 | to write and what you don't have to scaffold for them to do well.
00:18:01.960 | Yeah. And it's been, you know, playing with this, every turn is different. And it's interesting to see
00:18:08.600 | the things that it struggles with, like your name, even if I spell out the exact letters, it somehow really
00:18:16.680 | wants to spell Quinn the way it spells. I think it also, I mean, turn detection, as we saw,
00:18:23.320 | there's a lot of work that can be done, of course, there. And I'm trying to remember this. And there's,
00:18:29.480 | of course, a lot of variation. And sometimes here, like, there are times when it gets the grocery list
00:18:34.120 | perfect and, you know, combines the list perfectly. And sometimes it's a bit in the middle, like here.
00:18:38.920 | And the way this code works is it just for a given like session, it loads lots and lots and
00:18:44.200 | lots of previous conversational sessions in user assistant, user assistant sort of messages does
00:18:49.000 | sometimes depending on the version of the code I've like got running, it summarizes a little bit,
00:18:53.000 | sometimes it doesn't. So we really are leaning on the intelligence of the LLM to do all of the sort of
00:18:59.160 | contextual understanding about what we mean by a list, what we mean by the context in that in which we are
00:19:05.080 | talking about that list. It is super amazing that it that it works at all, basically, in my mind.
00:19:10.120 | And it's all voice driven, and it's all multimodal from the ground up. We have a whole nother video
00:19:14.680 | we can show, but I definitely think we're we're out of time. So we will.
00:19:18.280 | You have the final talk, and everyone seems to be excited.
00:19:21.560 | So maybe should we talk about our grandmothers?
00:19:24.760 | Oh, yes, I totally forgot that part. Sorry, let's skip past the demo where it gets the grocery list perfect.
00:19:30.120 | I think maybe this crowd would like to see that demo.
00:19:32.920 | No, that was great. So this has been fun for me to work on because like it's so relevant to my everyday life.
00:19:39.240 | But in Shrestha and I were talking about it, and I think there's actually something else that really kind of hooked me that she said.
00:19:45.320 | Yeah, so you know, my grandmother was Indian, of course, and she used to wear this cloth garment called a sari.
00:19:53.000 | And her way of reminding herself when she had to do things was tying knots on the sari, of course.
00:19:59.880 | And then I was chatting with Quinn. And what was incredible is apparently his grandmother in North Carolina.
00:20:06.120 | So very different from Calcutta in India used to tie strings around her fingers.
00:20:13.320 | Firstly, you know, this is kind of incredible. You know, no matter how many continents separate us, like smart people come up with the same generally intelligent patterns.
00:20:21.720 | But it's also incredible how technology allows humans to evolve.
00:20:27.560 | Now, the one problem with either the knots or the strings is you knew you had to remember something, but you didn't know what it was.
00:20:33.640 | So you still relied on your memory. And, you know, ultimately, that's why I do the work I do at Google, because I want to build the technologies
00:20:43.800 | that enable, you know, an infinite world of creative possibilities tomorrow, or even today, across continents.
00:20:53.560 | And I just want to say that we believe that voice is the most natural of interfaces.
00:21:00.520 | And there will come a world where most of the interaction with language models will happen via voice.
00:21:06.120 | And the Gemini models are trained to be multimodal from the ground up.
00:21:11.720 | So, of course, they ingest text, voice, but also images and video.
00:21:16.120 | So if you have any questions about Gemini, please reach out to me on X, on LinkedIn, email, wherever.
00:21:24.840 | Happy to work with builders like yourself.
00:21:26.840 | Yeah, thanks for coming to the talk. And we would love to see what you build with these models and APIs.