Thao Yeager: So voice is the most natural of interfaces. Humans are storytellers, talkers, listeners, conversationalists. We think out aloud. We learn to talk before we learn to read. And most of us talk faster than we type. We express emotion through our voices, and we use sound to understand the world around us.
We've been working together for the past few months, Shritshta, from the angle of models and APIs, and me from the application layer and agent framework direction. And I think we both believe that voice is a critical and universal building block for the whole next generation of Gen AI, especially at the UI level, but more generally as well.
Those of us who are early adopters of personal voice AI talk to our computers all the time. We think of the LLMs we talk to as sounding boards and coaches and interfaces to everything that lives on our devices and in the cloud. And this is not just an early adopter phenomenon.
Like, we already have voice agents deployed at scale. Language translation apps that translate between a patient and a doctor. Directed learning apps that a fourth grader can use to learn a topic they want to. Speech therapy apps and co-pilots that help people navigate complex enterprise software. One of the things we see in our work with customers at Daily is it's pretty common for people not to realize that they're talking to a voice agent on a phone call, even when you tell them at the beginning of the phone call that they're talking to an AI.
Yeah, and kids born today will probably take all of this for granted. But those of us who are living through this evolution of talking computers, this can sometimes feel like magic. But of course, anybody who's seen a really great magician prepare a magic trick knows that the magic is just the interface.
There's a lot of hard work that goes into creating that magic trick. So here's a partial list of the hard things that done right collectively add up to that magic. So real time responsiveness, which we've all in this whole track all day talked about as the foundation thing you have to get right or voice AI is unworkable.
Through the things that we're just starting to experiment with, like generating dynamic user interface elements for every conversational turn. These are the things we've been hacking on and thinking about together for the past few months. And we're not going to go over all of these today, although we did have a little extra time in the session, right, Thor?
Thor said we could talk for like a couple of hours, maybe, but we do have a framework that we thought would be useful to share with you, a framework that sort of maps onto how we've worked together from the model layer all the way up. Yeah, and this barely scratches the surface, but here are the layers of the voice AI stack.
So at the bottom, underpinning everything, you have the large language models that frontier labs like DeepMind work on. Then above that, you have carefully designed, but at this stage, constantly evolving, real-time APIs. Google's version is called the Gemini Live API. Above the APIs are the orchestration libraries and frameworks, like PipeCat, that help to manage and abstract the complexity of building these real-time multimodal applications.
And then, of course, at the top of the stack, you have the application code. For each of the hard things we listed on the previous slide, the code that implements that hard thing lives somewhere in that stack. So one of the ways we think about this is that there's a map, and you can sort of think about it two-dimensionally, maybe.
There's the, where does the code live that kind of solves the hard problem that you're, you know, you're thinking about as a voice agent developer? Where in the stack? And then how mature is our solution to that right now? Yeah, basically how solved is this thing? And what we've tried to do here is map all of these various things that you need to get right on a right-to-left axis of maturity.
And there are a couple of things that are kind of top of mind for me about this mapping. One is that I don't think of any of these things as more than about 50% solved. Totally arbitrary, like personal thing. Shrestha and I just argued about it a little bit.
Like what's the right way to represent that on this slide? But what we're trying to say is basically it's early. It's early for voice AI. And there's a lot of work to do at every part of the stack to get to that universal voice UI we're imagining. Yeah, and secondly, as this technology matures, and we've already seen some of this happening, the capabilities tend to move down the stack.
So what might happen is in your one-off individual applications, you might write some code to solve a specifically difficult challenge. Now, if enough people experience that challenge, then that tends to get built into the orchestration libraries and frameworks and then eventually make its way into the APIs. But independently of all of that, the models themselves are getting more and more generally capable.
I mean, we just talked about semantic voice activity detection in the previous talk. Yeah, this is like a great follow-on to Tom's talk about turn detection because I think turn detection is a perfect example of this. So like I built some of the first talk to an LLM voice AI applications a little over two years ago now.
And I tried to solve turn detection right there in the application code because there weren't any tools yet for it. A few months later, we built what we thought were pretty generalized at the time, state-of-the-art turn detection implementations into PipeCat. So moved down the layer into the framework. Now, Shrestha has turn detection in the multimodal live API, sort of inside the surface area of those same APIs that are doing inference and other things for you.
And I think all of us, as Tom said, expect the models over time to just do turn detection for us. And all those hard things, it varies depending on exactly what you're talking about of that long list we put together on that slide. But in general, I think everything is moving down the stack.
And then more and more interesting use cases are creating more things to put sort of at the top of the stack. Yeah, I will say we have server-side turn detection built in, but we also allow you to turn off turn detection and use models like Daily and LiveKit. So should we start with the demo?
Yeah, we can. We do have a demo to show you. And it's sort of a demo of some stuff I've been using in my own life every day for the last year or so. I've been experimenting with talking to my computer and my phone as much as I can to do various things, as you can imagine because I post about it probably too often on social media.
And one of the things that's become really sticky for me is kind of managing my priorities in like a very loose way with voice AI. These days, the conversations I have with my phone and my computer often tend to be in this sort of funny and interesting and new gray area for me between like a to-do list and brainstorming or kind of trying to work stuff out.
And because I write this code for myself and I'm always messing with it, this is definitely a cobbler's children never wear shoes situation. Like I don't have unit tests. I don't have evals for this code. I'm always using like Shresta's latest models. I'm always using some bleeding edge branch or pipe cat.
So what I thought would be fun to do is fire up a version of this code, have Shresta talk to it and see what breaks. And then maybe we can talk about like where those bugs are and like how I could have done better next time. Yeah, I will say having played with this, there's quite a lot of variation in how the model behaves.
So we'll see where we get to there. One thing that I think is interesting is as I write more and more of this code, I keep relearning the lesson that the way the models drive the whole application cycle is very different from my traditional programmers view of things. So half the things I'm sure you encountered when you were like playing with this are the model did something that my code actually didn't expect.
That might have been bad, but it might actually been good. And that's a whole sort of adjustment to how we think about these applications. Yeah, there's definitely pleasant surprises also. I am ready. Okay, do I talk? I think you can just talk and I don't think I can mute it because I don't know how to use the mouse backwards very well.
Okay. Yeah, you can talk whenever you're ready. No problem. I'm here to help whenever you need. Yep, I can hear you loud and clear. All right. Okay. What can I help you with today? Yes. Can you help me create a grocery list for asparagus pizza? Got it. Pizza ingredients are added to your girfie list.
Anything else? Can you list out the pizza ingredients for asparagus pizza though? Sure. I've updated the grocery list to include pizza crust, mozzarella cheese, tomato sauce, and asparagus. Anything else? Would you add anything else to the list? I've added garlic and olive oil to the list. Does that sound right?
That sounds perfect. All right. Now, let's create my reading list. So-- Great. Can you create-- Start adding to your reading list. Okay, great. Absolutely. What books would you like to have? Turn detection, yeah. Okay, reading list. Can you add the-- Please add the first book, Dream Count, and can you look up the author?
I've added quick to your reading list. No, that was not the book I wanted to add. Please add the first book, Dream Count, and look up the author for me. Hey, are you there? Hey, are you there? Segmentation fault. Got it. Segmentation fault is added to your reading list.
What's next? I guess I'll have to-- It clearly couldn't find the author for that one. But let's try one last time. Okay, one more book, Dream Count, and can you look up the author? Hey, are you there? Segmentation. Okay, segmentation has been added to your reading. It just likes your voice better than mine.
All right, no. Segmentation fault is great, but I would really like to read Dream Count. Can you add Dream Count? All right, hey, are you still there? Yes, I'm still here. How can I help you? Okay, can you add the second and third books of the three-body problem trilogy?
Sure, the second and third-- both segmentation fault and-- While segmentation-- Can you add the second and third books of the three-body-- This will also be a computer science concept. Is there anything else I can help you with? I swear it worked much better 10 times last night. I apologize for my performances in Up to Part today.
Is there anything specific you'd like me to try again? Well, maybe you can try-- all right, but let's recover here. Please add the second and third books of the three-body problem trilogy. You want to add the second and third books of the three-body problem trilogy. I've added-- I just want to point out that it did look up the names and find it.
Let's see if it'll co-operate it. This one might be in its training data because it's a bit older, but the dream count one, which it was spectacular on last night, it did a Google search because it's a new book. All right, let's move on. Can you create a new list for me called work tasks?
So first item, I would like to say, create H2 roadmap by end of day Friday. Got it. Create H2 roadmap by end of day Friday is added to your work list. I don't see it. Can you show me on the screen? I've displayed the work list for you. All right.
You should see it. Let's add another item. Okay, what's the next item? Finish writing podcast script by end of day Thursday. And can you add the exact date? Okay, I've added finish writing podcast script by end of day Thursday to your work list. What's the exact date for end of day Thursday?
Today is Wednesday, June 4th, 2025. So end of day Thursday is June 5th, 2025. Can you add that to the list? Yes, I've added the exact date to the work list. All right. It now says finish writing podcast script by end of day Thursday, June 5th, 2025. Is there anything else you'd like to add to the list?
Yes, actually. Or anything else? Yes. Yes, I would. Can you combine my grocery list, my reading list, and my work list and split them into two lists, personal tasks and work tasks? I've combined your grocery list, reading list, and work list. They are now displayed on the screen as separate lists.
Why are garlic and olive oil separate from pizza ingredients? Can you flesh out the grocery list, please? You're right. I've updated the grocery list to include garlic and olive oil as part of the pizza ingredients. All right. And can you consolidate the work list? I see two on the screen.
You're right. I apologize for the duplication. I've consolidated the work list to a single entry. You should now see the updated list on the screen. Okay. And then finally, can you do the following? Can you assign grocery list to Quinn, reading list to Shrestha, with the exception of segmentation fault, which Quinn will read, and work list to Shrestha.
I understand you want to assign the grocery list to Quinn, the reading list to Shrestha, except for segmentation fault, which Quinn will read, and the work list to Shrestha as well. Can you show that on the screen? I've updated the lists to show the assignments you specified. You should now see the updated lists on the screen.
And you've spelt my name correctly, which was the harder name, but you spelled Quinn incorrectly. But okay, we're out of time here. Okay. So if we want, the final thing here is we would like to have a nice background for this app. So can you-- I apologize for spelling your name.
Okay. Could you help us generate-- actually, I'll be more assertive. Please generate an app that has the words "hello world" jumping up and down on the screen, with the colors of Google. And then we also want two neon green ASCII cats, which are animated. I'm now generating an app with "hello world" jumping up and down in Google colors, and two animated neon green ASCII cats.
It may take a moment for the app to appear on your screen. Detention. So yeah, there you have the Google colors and the cats from PipeCat. So with that, I'll hand it over to Quinn. So first, you should go back to your pride of place right in the middle.
Thank you for being such a good sport. Very messy code on my part, including things like basically no instructions to the LLM about how to display text on the screen and just telling it it has a function that can display text on the screen and it sort of guesses and learns in context, as you can tell from Trista, about when it should clear the screen because there's an optional clear argument to the add text to the screen function.
And it's super impressive, but also super jagged frontier about whether it kind of can intuit what you want to do in those contexts. So thank you for like doing this because this is what I do all the time with this code trying to figure out like what these models can do and what kind of code you have to write and what you don't have to scaffold for them to do well.
Yeah. And it's been, you know, playing with this, every turn is different. And it's interesting to see the things that it struggles with, like your name, even if I spell out the exact letters, it somehow really wants to spell Quinn the way it spells. I think it also, I mean, turn detection, as we saw, there's a lot of work that can be done, of course, there.
And I'm trying to remember this. And there's, of course, a lot of variation. And sometimes here, like, there are times when it gets the grocery list perfect and, you know, combines the list perfectly. And sometimes it's a bit in the middle, like here. And the way this code works is it just for a given like session, it loads lots and lots and lots of previous conversational sessions in user assistant, user assistant sort of messages does sometimes depending on the version of the code I've like got running, it summarizes a little bit, sometimes it doesn't.
So we really are leaning on the intelligence of the LLM to do all of the sort of contextual understanding about what we mean by a list, what we mean by the context in that in which we are talking about that list. It is super amazing that it that it works at all, basically, in my mind.
And it's all voice driven, and it's all multimodal from the ground up. We have a whole nother video we can show, but I definitely think we're we're out of time. So we will. You have the final talk, and everyone seems to be excited. So maybe should we talk about our grandmothers?
Oh, yes, I totally forgot that part. Sorry, let's skip past the demo where it gets the grocery list perfect. I think maybe this crowd would like to see that demo. No, that was great. So this has been fun for me to work on because like it's so relevant to my everyday life.
But in Shrestha and I were talking about it, and I think there's actually something else that really kind of hooked me that she said. Yeah, so you know, my grandmother was Indian, of course, and she used to wear this cloth garment called a sari. And her way of reminding herself when she had to do things was tying knots on the sari, of course.
And then I was chatting with Quinn. And what was incredible is apparently his grandmother in North Carolina. So very different from Calcutta in India used to tie strings around her fingers. Firstly, you know, this is kind of incredible. You know, no matter how many continents separate us, like smart people come up with the same generally intelligent patterns.
But it's also incredible how technology allows humans to evolve. Now, the one problem with either the knots or the strings is you knew you had to remember something, but you didn't know what it was. So you still relied on your memory. And, you know, ultimately, that's why I do the work I do at Google, because I want to build the technologies that enable, you know, an infinite world of creative possibilities tomorrow, or even today, across continents.
And I just want to say that we believe that voice is the most natural of interfaces. And there will come a world where most of the interaction with language models will happen via voice. And the Gemini models are trained to be multimodal from the ground up. So, of course, they ingest text, voice, but also images and video.
So if you have any questions about Gemini, please reach out to me on X, on LinkedIn, email, wherever. Happy to work with builders like yourself. Yeah, thanks for coming to the talk. And we would love to see what you build with these models and APIs.