Your realtime AI is ngmi — Sean DuBois (OpenAI), Kwindla Kramer (Daily)

all right squabbert you ready to get packed up I don't know I'm pretty nervous oh relax you got nothing to worry about but this is like the worst idea ever for a live demo a live unscripted conversation with a non deterministic LLM on conference Wi-Fi come on your prompt is great and I mean it's gonna be on stage audio in a room full of echoes sometimes my text-to-speech even mispronounces my own name why did you even name me squabbert what if I say squiggly or something again okay take a breath well you don't really do that let's just take it one step at a time just start with the intro I guess you're right okay here I go hi everybody I'm squabbert here to take us on a whirlwind tour of the wonderful world of web RTC please welcome Shawn and Quinn hey I'm Shawn I work on web RTC at open AI some of the things you might be familiar with are the real-time API or 1-800 chat GPT you can call it right from your phone before I worked at open AI I worked on the go implementation of web RTC called pion and I'm Quinn I work at daily on real-time audio and video infrastructure and on an open source voice agent framework called pipe cat today we're going to talk about how to build natural fast human-like voice experiences we're going to give you a crash course course on low latency audio and video and I hope we'll show you a couple of things you might not have thought of around voice AI before if you want to build a conversational voice experience that people really love you really you're going to stress a lot about latency nothing else matters if your AI responds too slowly building voice AI experiences is similar to other kinds of AI engineering in most ways if you've built multi-turn agents a lot of that will port over to building voice agents but the big difference is latency everything in a voice AI app needs to be ground built from the ground up for fast response times if you're talking to a person around 500 milliseconds sounds natural when talking to an AI system people bring those same expectations response latencies much above the second in general doom your voice agent to very low completion rates and and low NPS scores and hang-ups and we're talking here about voice-to-voice latency so this is the time between when I the human stop talking and the time I hear the first audio byte come back from the LLM let's take a look at how latency adds up in a typical voice-to-voice AI application so this is a breakdown from a real voice AI running in a web browser on Mac OS talking over the internet to a voice agent running in the cloud running on pipe cat a couple of things to note our voice latency is just under a second that's good but not great we can make things a little faster but that comes with trade-offs lower quality or cost second it's frustratingly easy to do even worse than this your LLM might be slower other things get in the way or worst of all Bluetooth don't get me started on Bluetooth but the single biggest mistake we see people making is using the wrong approach to sending and receiving audio over the network it's time to talk about web RTC and web sockets if you're new to building voice applications you probably think hey I need a like a long-lived connection I'm gonna send audio and video over this long-lived connection I've used web sockets for long-lived data connections before I'm just going to write some web sockets code that's great if you're doing long-lived short small amounts of data it doesn't work for real-time audio in fact web sockets are almost the opposite of what you want from a network engineering perspective for real-time audio and video so let's do a compare and contrast on this web sockets are great if you're trying to deliver audio and you want something really easy that can target all platforms if you're trying to build a prototype lots of different platforms web sockets the way to go on the other hand web RTC solves a bunch of things around handling giving you high quality audio high bandwidth low latency but the the catch is it can be a lot more complicated to implement and it specializes but it could be frustrating lots of applications use both but for different things so here's the TLTR if you only remember one thing from this talk use web sockets for those server to server use cases and small amounts of structured data and places that you want a prototype use web RTC if you're sending audio and video streams over the internet from your web app your native that's where it excels so why is it so important to use web RTC for real-time edge to cloud audio a web socket is a TCP connection TCP guarantees in order delivery of network packets if you send some data that data is going to arrive exactly as you send it or it's not going to arrive at all you put packets in your operating systems send queue that OS queue is going to keep trying to send them until they either get act by the other side or your connection completely times out and this is in general what you want if you're doing most network programming if you're making a web request for example this is perfect it's not what you want if you're aiming for conversational latency remember that we're trying to hit a voice-to-voice latency of under one second and ideally even better what we want to ignore is things like the occasional packet loss so imagine if a packet is dropped I don't really care about what happened a second ago so web RTC does clever math and buffer management that we're going to talk about more to hide where that happens so this is the first and most important thing web RTC does for you it's all that machinery that sends packets as fast as possible ignores packets that don't arrive inside that very tight latency budget we're operating within even think of it as like super fast best effort networking if this were all web RTC could do for you compared to web sockets it would still be worth using web RTC just for this because you literally can't implement this on top of a TCP stack or on top of web sockets again the operating system is just going to try to keep sending whatever you tell it to send it's going to block everything if you have any packet loss or significant sort of jitter or delay in the network and in real world we have lots and lots of real world data on this in the real world this means you will get audio glitchiness or high latency or unexpected socket disconnections in 10 to 15 percent of your network connections but web RTC does a lot more than that if you go and you try to build the same application in web sockets you have to handle resampling you have to handle packetization and doing all that bandwidth estimation networks are constantly changing and fluctuating so you can't just send one bit rate and you also get standard API's for getting the stats and observability this is all just built into web RTC but if you decide to do web sockets you have to build it yourself so if you look at this code up on the screen on the right side is an example of web RTC sending one bi-directional stream of audio on the left side is web sockets and you want to spend more time building your application and less time worrying about things like sample rates that's why you pick web RTC and this is real code using your open AI real-time API which you you offer both both options for developers so I hope we've convinced you that you should use web RTC if you're doing edge to cloud audio and especially audio and video we love talking about this stuff if you come find us later we will talk your ear off about jitter buffers and packet management and bandwidth shaping and all that stuff but what we want to do now is move on and talk about a whole nother category of fun stuff which is what you can actually do with web RTC I'll start by saying you can embed real-time audio in any app you write any website any iOS app any Android app lots of fun embedded stuff and the network connection connections will just work you will get good audio on any device any platform almost any real-world network connection and I bet you use web RTC web RTC today already if you used Facebook Messenger WhatsApp Zoom Discord you know any of these applications they're using web RTC but you didn't know that there's even more cool things happening with web RTC I worked with a company that was doing surgery over the internet people will tell it up can vehicles in the field it's super cool web RTC is kind of the standard language of the real-time world and that's why it makes so easy that we can go build conversational intelligence on top of it all the stuff's already been solved I mean in the new LLM era we know lots of people who spend hours talking to computers driving their developments environments with voice doing brainstorming treating the computer as a personal assistant a coach a therapist a researcher I'm convinced voice is going to be the core building block of the next generation of UIs of the UIs for the generative AI era we have to do a little iteration before we figure out really what those UIs look and sound like but that future seems very clear to me one of the things I say to people to kind of try to communicate how excited I am about building all this stuff is we all lived through the last platform shift it was recent enough that the move for most computing from desktop to mobile happened within our memories and so by analogy we are in late 2007 now we have the first iPhones but we haven't yet invented pull to refresh what keeps me so excited and motivated is I feel like voice is like the next bicycle for the mind you know today we've only been able to use computers with our eyes and our hands but now I can talk to it as well think of all those situations where like you you have your voice available you don't have your hands and the other great thing is you can keep all this computing power remotely and access it and have small devices near you so on the small devices front we brought a friend with us today uh he's a little shy sometimes so give him just a second so uh squabbert can you tell people a little bit about your tech stack i'm a friendly stuffed animal running a web interface locally on a raspberry pi i'm connected directly to a python process running on a laptop using mlx whisper gemma 3 and a mildly buggy custom logic sampler that quinn wrote i mean come on now like i had a good reason that i wanted to write a logit sampler and i i i wanted you to be able to do something that even the really amazing big llms running in the cloud can't do do you know what i'm talking about you're talking about my syllable counting talent it's pretty neat even if i do say so myself i mean counting syllables is like the 2025 version of how many r's are in strawberry and you know scrub why don't we why don't we demonstrate that a little bit can you make up a poem like a four-line poem about computer programming and use only two syllable words two okay listen logic coding people knowing system working future growing one more maybe uh that was good but i've heard you do better okay here's another one pixels dancing colors bright program running feels so right coding faster day and night making magic pure delight you forgot about the two syllable words thing but we'll we'll we'll let it go yeah definitely the first one had two civil words i think i think we should say goodbye just in the interest of time squabber but thank you so much i think squabber did really really well um so that that is a raspberry pi connected with a peer-to-peer web rtc connection directly to my laptop over the same local area network so that was the serverless web rtc connection squabber it's talking directly to the laptop but what's super cool about web rtc is you have all these different choices to how you want to connect to things so you could do this local connection or something like squabber could connect to a server um up running on another in the cloud and do all the ai stuff and then the third option is you can go and connect up to something like pipecat and make it multi-party so bring llms into meetings or other places like that um super awesome how like you get all this flexibility to build these things the way that like matches your application so we want to close the video with actually a builder who's in the audience um super excited and um what's the best part about this is that um when she built this she had never written any code before this and so what makes me so excited about the future of voice is we can make this easy enough people that have really innovative inspirational ideas can go and do stuff themselves hi i'm yashin i'm a mom of two bilingual kids i'm raising my kids bilingual because i want them to connect with my cultural roots and that's always shared by many parents of bilingual children but raising bilingual kids is hard it's expensive it's time consuming and it often feels like a big chore on both the kids and parents i believe we can do a lot better with today's technology i really believe we can make language education feel more natural and even fun for the kids here's a quick clip of what i've been working on hi there buddy ready to have some fun today let's say hello in mandarin we say nǐ hǎo can you say nǐ hǎo nǐ hǎo okay it's too early but i'm excited about what's possible i'm not technical but with just a little bit of guidance from a kind member of this community i was able to bring this first version to life i already have a group of eager testers mostly parents like me who are excited to try this if this also sounds exciting to you i would love to connect thanks so much for watching and let's build this future together so we put the qr code up here for yashin's project which i absolutely love she's here uh if you're interested if you're here in the audience and uh you're interested in multilingual stuff or building these kind of things please find her it's such a great great great thing um if you're watching on youtube here's the qr code sean and i are super excited about these kind of projects in i mean the the person yeah she shouted out in the video is of course sean who has done more than anybody i know to make web rtc accessible to everyone the idea that we want to leave you with is if you have an idea in voice ai and web rtc whether you've been a programmer for years or you're just getting started like we are here to support you we're so excited and we believe that things like pipe cat and live peer like if we can make this easier we're going to see the next generation of really exciting innovative projects come find us in the hallway or online we hang out on discord and twitter and linkedin and uh here are the resources that you can scan and hopefully they find helpful so quinn wrote an amazing book that i believe is in your bag um and then yeah so thank you so much we can't wait to see what you build

Your realtime AI is ngmi — Sean DuBois (OpenAI), Kwindla Kramer (Daily)

Chapters

Transcript