back to indexMilliseconds to Magic: Real‑Time Workflows using the Gemini Live API and Pipecat

00:00:20.400 |
Humans are storytellers, talkers, listeners, conversationalists. 00:00:40.840 |
and we use sound to understand the world around us. 00:00:46.460 |
We've been working together for the past few months, 00:00:48.960 |
Shritshta, from the angle of models and APIs, 00:00:52.140 |
and me from the application layer and agent framework direction. 00:00:57.180 |
is a critical and universal building block for the whole next generation 00:01:00.940 |
of Gen AI, especially at the UI level, but more generally as well. 00:01:04.700 |
Those of us who are early adopters of personal voice AI talk to our computers all the time. 00:01:09.460 |
We think of the LLMs we talk to as sounding boards and coaches 00:01:12.960 |
and interfaces to everything that lives on our devices and in the cloud. 00:01:17.880 |
And this is not just an early adopter phenomenon. 00:01:21.640 |
Like, we already have voice agents deployed at scale. 00:01:26.340 |
Language translation apps that translate between a patient and a doctor. 00:01:30.880 |
Directed learning apps that a fourth grader can use to learn a topic they want to. 00:01:36.320 |
Speech therapy apps and co-pilots that help people navigate complex enterprise software. 00:01:43.080 |
One of the things we see in our work with customers at Daily is it's pretty common for people not to realize that they're talking to a voice agent on a phone call, 00:01:52.080 |
even when you tell them at the beginning of the phone call that they're talking to an AI. 00:01:57.080 |
Yeah, and kids born today will probably take all of this for granted. 00:02:00.920 |
But those of us who are living through this evolution of talking computers, this can sometimes feel like magic. 00:02:08.040 |
But of course, anybody who's seen a really great magician prepare a magic trick knows that the magic is just the interface. 00:02:17.640 |
There's a lot of hard work that goes into creating that magic trick. 00:02:24.760 |
So here's a partial list of the hard things that done right collectively add up to that magic. 00:02:29.160 |
So real time responsiveness, which we've all in this whole track all day talked about as the foundation thing you have to get right or voice AI is unworkable. 00:02:37.000 |
Through the things that we're just starting to experiment with, like generating dynamic user interface elements for every conversational turn. 00:02:45.400 |
These are the things we've been hacking on and thinking about together for the past few months. 00:02:49.600 |
And we're not going to go over all of these today, although we did have a little extra time in the session, right, Thor? 00:02:54.400 |
Thor said we could talk for like a couple of hours, maybe, but we do have a framework that we thought would be useful to share with you, 00:03:02.000 |
a framework that sort of maps onto how we've worked together from the model layer all the way up. 00:03:06.960 |
Yeah, and this barely scratches the surface, but here are the layers of the voice AI stack. 00:03:12.880 |
So at the bottom, underpinning everything, you have the large language models that frontier labs like DeepMind work on. 00:03:20.480 |
Then above that, you have carefully designed, but at this stage, constantly evolving, real-time APIs. 00:03:28.080 |
Google's version is called the Gemini Live API. 00:03:31.520 |
Above the APIs are the orchestration libraries and frameworks, like PipeCat, that help to manage and abstract the complexity of building these real-time multimodal applications. 00:03:46.080 |
And then, of course, at the top of the stack, you have the application code. 00:03:51.040 |
For each of the hard things we listed on the previous slide, the code that implements that hard thing lives somewhere in that stack. 00:03:57.680 |
So one of the ways we think about this is that there's a map, and you can sort of think about it two-dimensionally, maybe. 00:04:03.760 |
There's the, where does the code live that kind of solves the hard problem that you're, you know, you're thinking about as a voice agent developer? 00:04:12.000 |
And then how mature is our solution to that right now? 00:04:19.520 |
And what we've tried to do here is map all of these various things that you need to get right on a right-to-left axis of maturity. 00:04:28.480 |
And there are a couple of things that are kind of top of mind for me about this mapping. 00:04:32.320 |
One is that I don't think of any of these things as more than about 50% solved. 00:04:38.960 |
Shrestha and I just argued about it a little bit. 00:04:40.560 |
Like what's the right way to represent that on this slide? 00:04:43.360 |
But what we're trying to say is basically it's early. 00:04:46.960 |
And there's a lot of work to do at every part of the stack to get to that universal voice UI we're imagining. 00:04:54.960 |
Yeah, and secondly, as this technology matures, and we've already seen some of this happening, 00:05:01.200 |
the capabilities tend to move down the stack. 00:05:05.600 |
So what might happen is in your one-off individual applications, you might write some code to solve a specifically difficult challenge. 00:05:13.360 |
Now, if enough people experience that challenge, then that tends to get built into the orchestration libraries and frameworks and then eventually make its way into the APIs. 00:05:26.480 |
But independently of all of that, the models themselves are getting more and more generally capable. 00:05:33.680 |
I mean, we just talked about semantic voice activity detection in the previous talk. 00:05:38.000 |
Yeah, this is like a great follow-on to Tom's talk about turn detection because I think turn detection is a perfect example of this. 00:05:46.400 |
So like I built some of the first talk to an LLM voice AI applications a little over two years ago now. 00:05:52.400 |
And I tried to solve turn detection right there in the application code because there weren't any tools yet for it. 00:05:58.640 |
A few months later, we built what we thought were pretty generalized at the time, state-of-the-art turn detection implementations into PipeCat. 00:06:08.320 |
Now, Shrestha has turn detection in the multimodal live API, sort of inside the surface area of those same APIs that are doing inference and other things for you. 00:06:18.000 |
And I think all of us, as Tom said, expect the models over time to just do turn detection for us. 00:06:23.960 |
And all those hard things, it varies depending on exactly what you're talking about of that long list we put together on that slide. 00:06:32.120 |
But in general, I think everything is moving down the stack. 00:06:35.800 |
And then more and more interesting use cases are creating more things to put sort of at the top of the stack. 00:06:40.200 |
Yeah, I will say we have server-side turn detection built in, but we also allow you to turn off turn detection and use models like Daily and LiveKit. 00:07:00.040 |
And it's sort of a demo of some stuff I've been using in my own life every day for the last year or so. 00:07:06.920 |
I've been experimenting with talking to my computer and my phone as much as I can to do various things, as you can imagine because I post about it probably too often on social media. 00:07:17.320 |
And one of the things that's become really sticky for me is kind of managing my priorities in like a very loose way with voice AI. 00:07:25.320 |
These days, the conversations I have with my phone and my computer often tend to be in this sort of funny and interesting and new gray area for me between like a to-do list and brainstorming or kind of trying to work stuff out. 00:07:37.720 |
And because I write this code for myself and I'm always messing with it, this is definitely a cobbler's children never wear shoes situation. 00:07:49.480 |
I'm always using like Shresta's latest models. 00:07:52.360 |
I'm always using some bleeding edge branch or pipe cat. 00:07:55.080 |
So what I thought would be fun to do is fire up a version of this code, have Shresta talk to it and see what breaks. 00:08:03.720 |
And then maybe we can talk about like where those bugs are and like how I could have done better next time. 00:08:09.080 |
Yeah, I will say having played with this, there's quite a lot of variation in how the model behaves. 00:08:18.440 |
One thing that I think is interesting is as I write more and more of this code, I keep relearning the lesson that the way the models drive the whole application cycle is very different from my traditional programmers view of things. 00:08:33.480 |
So half the things I'm sure you encountered when you were like playing with this are the model did something that my code actually didn't expect. 00:08:39.800 |
That might have been bad, but it might actually been good. 00:08:42.600 |
And that's a whole sort of adjustment to how we think about these applications. 00:08:45.480 |
Yeah, there's definitely pleasant surprises also. 00:08:54.280 |
I think you can just talk and I don't think I can mute it because I don't know how to use the mouse backwards very well. 00:09:16.200 |
Can you help me create a grocery list for asparagus pizza? 00:09:24.520 |
Pizza ingredients are added to your girfie list. 00:09:28.200 |
Can you list out the pizza ingredients for asparagus pizza though? 00:09:35.640 |
I've updated the grocery list to include pizza crust, mozzarella cheese, tomato sauce, and asparagus. 00:10:15.720 |
Please add the first book, Dream Count, and can you look up the author? 00:10:24.680 |
Please add the first book, Dream Count, and look up the author for me. 00:10:44.440 |
Segmentation fault is added to your reading list. 00:10:52.440 |
It clearly couldn't find the author for that one. 00:11:01.320 |
Okay, one more book, Dream Count, and can you look up the author? 00:11:14.200 |
Okay, segmentation has been added to your reading. 00:11:26.680 |
Segmentation fault is great, but I would really like to read Dream Count. 00:11:46.360 |
Okay, can you add the second and third books of the three-body problem trilogy? 00:11:53.320 |
Sure, the second and third-- both segmentation fault and-- 00:11:58.520 |
Can you add the second and third books of the three-body-- 00:12:01.960 |
This will also be a computer science concept. 00:12:05.320 |
I swear it worked much better 10 times last night. 00:12:11.640 |
I apologize for my performances in Up to Part today. 00:12:15.720 |
Is there anything specific you'd like me to try again? 00:12:18.520 |
Well, maybe you can try-- all right, but let's recover here. 00:12:26.600 |
Please add the second and third books of the three-body problem trilogy. 00:12:31.560 |
You want to add the second and third books of the three-body problem trilogy. 00:12:40.600 |
I just want to point out that it did look up the names and find it. 00:12:47.800 |
This one might be in its training data because it's a bit older, 00:12:54.200 |
but the dream count one, which it was spectacular on last night, 00:12:58.680 |
it did a Google search because it's a new book. 00:13:03.400 |
Can you create a new list for me called work tasks? 00:13:07.400 |
So first item, I would like to say, create H2 roadmap by end of day Friday. 00:13:18.280 |
Create H2 roadmap by end of day Friday is added to your work list. 00:13:34.280 |
Finish writing podcast script by end of day Thursday. 00:13:39.000 |
Okay, I've added finish writing podcast script by end of day Thursday to your work list. 00:13:46.920 |
What's the exact date for end of day Thursday? 00:14:00.120 |
Yes, I've added the exact date to the work list. 00:14:05.720 |
It now says finish writing podcast script by end of day Thursday, June 5th, 2025. 00:14:10.760 |
Is there anything else you'd like to add to the list? 00:14:15.320 |
Can you combine my grocery list, my reading list, and my work list and split them into two lists, 00:14:26.440 |
I've combined your grocery list, reading list, and work list. 00:14:32.920 |
They are now displayed on the screen as separate lists. 00:14:35.560 |
Why are garlic and olive oil separate from pizza ingredients? 00:14:47.800 |
I've updated the grocery list to include garlic and olive oil as part of the pizza ingredients. 00:15:01.880 |
I've consolidated the work list to a single entry. 00:15:04.760 |
You should now see the updated list on the screen. 00:15:11.560 |
Can you assign grocery list to Quinn, reading list to Shrestha, 00:15:17.800 |
with the exception of segmentation fault, which Quinn will read, 00:15:23.720 |
I understand you want to assign the grocery list to Quinn, 00:15:28.040 |
the reading list to Shrestha, except for segmentation fault, which Quinn will read, 00:15:35.960 |
I've updated the lists to show the assignments you specified. 00:15:42.520 |
You should now see the updated lists on the screen. 00:15:44.600 |
And you've spelt my name correctly, which was the harder name, but you spelled Quinn incorrectly. 00:15:55.720 |
So if we want, the final thing here is we would like to have a nice background for this app. 00:16:06.280 |
Could you help us generate-- actually, I'll be more assertive. 00:16:10.920 |
Please generate an app that has the words "hello world" jumping up and down on the screen, 00:16:21.640 |
And then we also want two neon green ASCII cats, which are animated. 00:16:27.720 |
I'm now generating an app with "hello world" jumping up and down in Google colors, 00:16:39.480 |
It may take a moment for the app to appear on your screen. 00:17:01.400 |
So yeah, there you have the Google colors and the cats from PipeCat. 00:17:14.680 |
So first, you should go back to your pride of place right in the middle. 00:17:22.600 |
Very messy code on my part, including things like basically no instructions to the LLM about 00:17:28.120 |
how to display text on the screen and just telling it it has a function that can display text on the 00:17:33.000 |
screen and it sort of guesses and learns in context, as you can tell from Trista, about when it should 00:17:38.360 |
clear the screen because there's an optional clear argument to the add text to the screen function. 00:17:43.720 |
And it's super impressive, but also super jagged frontier about whether it kind of can intuit what 00:17:49.560 |
you want to do in those contexts. So thank you for like doing this because this is what I do all the 00:17:54.840 |
time with this code trying to figure out like what these models can do and what kind of code you have 00:17:59.080 |
to write and what you don't have to scaffold for them to do well. 00:18:01.960 |
Yeah. And it's been, you know, playing with this, every turn is different. And it's interesting to see 00:18:08.600 |
the things that it struggles with, like your name, even if I spell out the exact letters, it somehow really 00:18:16.680 |
wants to spell Quinn the way it spells. I think it also, I mean, turn detection, as we saw, 00:18:23.320 |
there's a lot of work that can be done, of course, there. And I'm trying to remember this. And there's, 00:18:29.480 |
of course, a lot of variation. And sometimes here, like, there are times when it gets the grocery list 00:18:34.120 |
perfect and, you know, combines the list perfectly. And sometimes it's a bit in the middle, like here. 00:18:38.920 |
And the way this code works is it just for a given like session, it loads lots and lots and 00:18:44.200 |
lots of previous conversational sessions in user assistant, user assistant sort of messages does 00:18:49.000 |
sometimes depending on the version of the code I've like got running, it summarizes a little bit, 00:18:53.000 |
sometimes it doesn't. So we really are leaning on the intelligence of the LLM to do all of the sort of 00:18:59.160 |
contextual understanding about what we mean by a list, what we mean by the context in that in which we are 00:19:05.080 |
talking about that list. It is super amazing that it that it works at all, basically, in my mind. 00:19:10.120 |
And it's all voice driven, and it's all multimodal from the ground up. We have a whole nother video 00:19:14.680 |
we can show, but I definitely think we're we're out of time. So we will. 00:19:18.280 |
You have the final talk, and everyone seems to be excited. 00:19:21.560 |
So maybe should we talk about our grandmothers? 00:19:24.760 |
Oh, yes, I totally forgot that part. Sorry, let's skip past the demo where it gets the grocery list perfect. 00:19:30.120 |
I think maybe this crowd would like to see that demo. 00:19:32.920 |
No, that was great. So this has been fun for me to work on because like it's so relevant to my everyday life. 00:19:39.240 |
But in Shrestha and I were talking about it, and I think there's actually something else that really kind of hooked me that she said. 00:19:45.320 |
Yeah, so you know, my grandmother was Indian, of course, and she used to wear this cloth garment called a sari. 00:19:53.000 |
And her way of reminding herself when she had to do things was tying knots on the sari, of course. 00:19:59.880 |
And then I was chatting with Quinn. And what was incredible is apparently his grandmother in North Carolina. 00:20:06.120 |
So very different from Calcutta in India used to tie strings around her fingers. 00:20:13.320 |
Firstly, you know, this is kind of incredible. You know, no matter how many continents separate us, like smart people come up with the same generally intelligent patterns. 00:20:21.720 |
But it's also incredible how technology allows humans to evolve. 00:20:27.560 |
Now, the one problem with either the knots or the strings is you knew you had to remember something, but you didn't know what it was. 00:20:33.640 |
So you still relied on your memory. And, you know, ultimately, that's why I do the work I do at Google, because I want to build the technologies 00:20:43.800 |
that enable, you know, an infinite world of creative possibilities tomorrow, or even today, across continents. 00:20:53.560 |
And I just want to say that we believe that voice is the most natural of interfaces. 00:21:00.520 |
And there will come a world where most of the interaction with language models will happen via voice. 00:21:06.120 |
And the Gemini models are trained to be multimodal from the ground up. 00:21:11.720 |
So, of course, they ingest text, voice, but also images and video. 00:21:16.120 |
So if you have any questions about Gemini, please reach out to me on X, on LinkedIn, email, wherever. 00:21:26.840 |
Yeah, thanks for coming to the talk. And we would love to see what you build with these models and APIs.