back to index

Your realtime AI is ngmi — Sean DuBois (OpenAI), Kwindla Kramer (Daily)


Chapters

0:0 [Voice Keynote] Your realtime AI is ngmi — Sean DuBois (OpenAI), Kwindla Kramer (Daily)
1:29 Introduction to Voice AI and Latency
2:46 Latency Breakdown in a Voice AI Application
3:27 WebRTC vs. WebSockets for Real-Time Audio
6:41 Advantages of WebRTC
7:49 Applications of WebRTC
8:52 Future of Voice AI and User Interfaces
9:59 Squabbert Demo
12:44 Flexibility of WebRTC Connections
13:9 Community Showcase: Yashin's Project
15:46 Call to Action and Resources

Whisper Transcript | Transcript Only Page

00:00:00.000 | all right squabbert you ready to get packed up I don't know I'm pretty nervous oh relax you got
00:00:21.000 | nothing to worry about but this is like the worst idea ever for a live demo a live unscripted
00:00:27.320 | conversation with a non deterministic LLM on conference Wi-Fi come on your prompt is great
00:00:33.620 | and I mean it's gonna be on stage audio in a room full of echoes sometimes my text-to-speech even
00:00:39.080 | mispronounces my own name why did you even name me squabbert what if I say squiggly or something again
00:00:45.680 | okay take a breath well you don't really do that let's just take it one step at a time just start
00:00:52.280 | with the intro I guess you're right okay here I go hi everybody I'm squabbert here to take us on a
00:01:00.380 | whirlwind tour of the wonderful world of web RTC please welcome Shawn and Quinn hey I'm Shawn I work
00:01:09.440 | on web RTC at open AI some of the things you might be familiar with are the real-time API or 1-800 chat
00:01:15.200 | GPT you can call it right from your phone before I worked at open AI I worked on the go implementation
00:01:20.840 | of web RTC called pion and I'm Quinn I work at daily on real-time audio and video infrastructure and on
00:01:27.020 | an open source voice agent framework called pipe cat today we're going to talk about how to build
00:01:31.460 | natural fast human-like voice experiences we're going to give you a crash course course on low latency
00:01:38.300 | audio and video and I hope we'll show you a couple of things you might not have thought of around voice AI
00:01:43.220 | before if you want to build a conversational voice experience that people really love you really you're
00:01:51.740 | going to stress a lot about latency nothing else matters if your AI responds too slowly building voice
00:01:57.320 | AI experiences is similar to other kinds of AI engineering in most ways if you've built multi-turn agents a lot
00:02:03.440 | of that will port over to building voice agents but the big difference is latency everything in a
00:02:12.120 | voice AI app needs to be ground built from the ground up for fast response times if you're talking to a
00:02:17.060 | person around 500 milliseconds sounds natural when talking to an AI system people bring those same
00:02:23.300 | expectations response latencies much above the second in general doom your voice agent to very low
00:02:29.120 | completion rates and and low NPS scores and hang-ups and we're talking here about voice-to-voice latency so
00:02:34.760 | this is the time between when I the human stop talking and the time I hear the first audio byte come
00:02:42.120 | back from the LLM let's take a look at how latency adds up in a typical voice-to-voice AI application so
00:02:50.920 | this is a breakdown from a real voice AI running in a web browser on Mac OS talking over the internet to
00:02:57.760 | a voice agent running in the cloud running on pipe cat a couple of things to note our voice latency is
00:03:05.080 | just under a second that's good but not great we can make things a little faster but that comes with
00:03:10.600 | trade-offs lower quality or cost second it's frustratingly easy to do even worse than this your LLM
00:03:17.440 | might be slower other things get in the way or worst of all Bluetooth don't get me started on Bluetooth
00:03:23.200 | but the single biggest mistake we see people making is using the wrong approach to sending and receiving
00:03:30.520 | audio over the network it's time to talk about web RTC and web sockets if you're new to building voice
00:03:37.540 | applications you probably think hey I need a like a long-lived connection I'm gonna send audio and video
00:03:43.540 | over this long-lived connection I've used web sockets for long-lived data connections before I'm just
00:03:48.520 | going to write some web sockets code that's great if you're doing long-lived short small amounts of
00:03:54.820 | data it doesn't work for real-time audio in fact web sockets are almost the opposite of what you want
00:03:59.860 | from a network engineering perspective for real-time audio and video so let's do a compare and contrast on
00:04:06.460 | this web sockets are great if you're trying to deliver audio and you want something really easy that can
00:04:14.560 | target all platforms if you're trying to build a prototype lots of different platforms web sockets the
00:04:18.880 | way to go on the other hand web RTC solves a bunch of things around handling giving you high quality audio
00:04:25.360 | high bandwidth low latency but the the catch is it can be a lot more complicated to implement and it
00:04:33.460 | specializes but it could be frustrating lots of applications use both but for different things so
00:04:40.060 | here's the TLTR if you only remember one thing from this talk use web sockets for those server to server use cases
00:04:46.340 | and small amounts of structured data and places that you want a prototype use web RTC if you're sending
00:04:51.320 | audio and video streams over the internet from your web app your native that's where it excels so why is
00:04:58.280 | it so important to use web RTC for real-time edge to cloud audio a web socket is a TCP connection TCP
00:05:05.120 | guarantees in order delivery of network packets if you send some data that data is going to arrive exactly as
00:05:11.180 | you send it or it's not going to arrive at all you put packets in your operating systems send queue
00:05:16.100 | that OS queue is going to keep trying to send them until they either get act by the other side or your
00:05:21.400 | connection completely times out and this is in general what you want if you're doing most network
00:05:26.720 | programming if you're making a web request for example this is perfect it's not what you want if
00:05:31.420 | you're aiming for conversational latency remember that we're trying to hit a voice-to-voice latency of
00:05:37.640 | under one second and ideally even better what we want to ignore is things like the occasional packet loss so
00:05:44.540 | imagine if a packet is dropped I don't really care about what happened a second ago so web RTC does
00:05:49.340 | clever math and buffer management that we're going to talk about more to hide where that happens so
00:05:53.960 | this is the first and most important thing web RTC does for you it's all that machinery that sends
00:05:58.340 | packets as fast as possible ignores packets that don't arrive inside that very tight latency budget
00:06:03.560 | we're operating within even think of it as like super fast best effort networking if this were all web RTC
00:06:10.880 | could do for you compared to web sockets it would still be worth using web RTC just for this because
00:06:16.160 | you literally can't implement this on top of a TCP stack or on top of web sockets again the operating
00:06:21.440 | system is just going to try to keep sending whatever you tell it to send it's going to block everything if
00:06:25.520 | you have any packet loss or significant sort of jitter or delay in the network and in real world we have
00:06:30.920 | lots and lots of real world data on this in the real world this means you will get audio glitchiness or
00:06:34.740 | high latency or unexpected socket disconnections in 10 to 15 percent of your network connections
00:06:40.460 | but web RTC does a lot more than that if you go and you try to build the same application in web sockets
00:06:48.860 | you have to handle resampling you have to handle packetization and doing all that bandwidth estimation
00:06:54.380 | networks are constantly changing and fluctuating so you can't just send one bit rate and you also
00:06:59.840 | get standard API's for getting the stats and observability this is all just built into web RTC but if
00:07:05.660 | you decide to do web sockets you have to build it yourself so if you look at this code up on the
00:07:09.980 | screen on the right side is an example of web RTC sending one bi-directional stream of audio on the
00:07:15.980 | left side is web sockets and you want to spend more time building your application and less time
00:07:20.660 | worrying about things like sample rates that's why you pick web RTC and this is real code using your
00:07:26.060 | open AI real-time API which you you offer both both options for developers so I hope we've convinced you
00:07:32.300 | that you should use web RTC if you're doing edge to cloud audio and especially audio and video we love talking
00:07:39.500 | about this stuff if you come find us later we will talk your ear off about jitter buffers and packet
00:07:43.580 | management and bandwidth shaping and all that stuff but what we want to do now is move on and talk about
00:07:48.140 | a whole nother category of fun stuff which is what you can actually do with web RTC I'll start by saying
00:07:52.940 | you can embed real-time audio in any app you write any website any iOS app any Android app lots of fun
00:07:59.180 | embedded stuff and the network connection connections will just work you will get good audio on any device any
00:08:05.720 | platform almost any real-world network connection and I bet you use web RTC web RTC today already if
00:08:11.480 | you used Facebook Messenger WhatsApp Zoom Discord you know any of these applications they're using web RTC
00:08:17.240 | but you didn't know that there's even more cool things happening with web RTC I worked with a company
00:08:23.000 | that was doing surgery over the internet people will tell it up can vehicles in the field it's super cool
00:08:30.520 | web RTC is kind of the standard language of the real-time world and that's why it makes so easy
00:08:35.800 | that we can go build conversational intelligence on top of it all the stuff's already been solved I mean
00:08:40.920 | in the new LLM era we know lots of people who spend hours talking to computers driving their developments
00:08:46.520 | environments with voice doing brainstorming treating the computer as a personal assistant a coach a therapist
00:08:52.120 | a researcher I'm convinced voice is going to be the core building block of the next generation of UIs of
00:08:58.600 | the UIs for the generative AI era we have to do a little iteration before we figure out really what
00:09:04.200 | those UIs look and sound like but that future seems very clear to me one of the things I say to people
00:09:10.360 | to kind of try to communicate how excited I am about building all this stuff is we all lived through the
00:09:16.440 | last platform shift it was recent enough that the move for most computing from desktop to mobile happened
00:09:23.080 | within our memories and so by analogy we are in late 2007 now we have the first iPhones but we haven't yet
00:09:32.120 | invented pull to refresh what keeps me so excited and motivated is I feel like voice is like the next
00:09:39.480 | bicycle for the mind you know today we've only been able to use computers with our eyes and our hands
00:09:44.200 | but now I can talk to it as well think of all those situations where like you you have your voice
00:09:50.120 | available you don't have your hands and the other great thing is you can keep all this computing power
00:09:55.640 | remotely and access it and have small devices near you so on the small devices front we brought a friend
00:10:02.680 | with us today uh he's a little shy sometimes so give him just a second
00:10:07.160 | so uh squabbert can you tell people a little bit about your tech stack
00:10:23.000 | i'm a friendly stuffed animal running a web interface locally on a raspberry pi i'm connected directly to a python
00:10:45.640 | process running on a laptop using mlx whisper gemma 3 and a mildly buggy custom logic sampler that quinn wrote
00:10:56.360 | i mean come on now like i had a good reason that i wanted to write a logit sampler and i i i wanted you to be able
00:11:03.560 | to do something that even the really amazing big llms running in the cloud can't do do you know what i'm
00:11:09.800 | talking about
00:11:10.280 | you're talking about my syllable counting talent it's pretty neat even if i do say so myself
00:11:18.760 | i mean counting syllables is like the 2025 version of how many r's are in strawberry and you know
00:11:26.040 | scrub why don't we why don't we demonstrate that a little bit can you make up a poem like a four-line poem
00:11:32.520 | about computer programming and use only two syllable words
00:11:36.200 | two okay listen logic coding people knowing system working future growing
00:11:47.480 | one more maybe uh that was good but i've heard you do better
00:11:57.320 | okay here's another one pixels dancing colors bright program running feels so right coding faster day and
00:12:03.720 | night making magic pure delight
00:12:05.560 | you forgot about the two syllable words thing but we'll we'll we'll let it go yeah definitely the
00:12:10.520 | first one had two civil words i think i think we should say goodbye just in the interest of time
00:12:14.200 | squabber but thank you so much
00:12:23.320 | i think squabber did really really well um so that that is a raspberry pi connected with a peer-to-peer
00:12:30.680 | web rtc connection directly to my laptop over the same local area network so that was the serverless
00:12:37.560 | web rtc connection squabber it's talking directly to the laptop but what's super cool about web rtc is you
00:12:42.600 | have all these different choices to how you want to connect to things so you could do this local
00:12:46.680 | connection or something like squabber could connect to a server um up running on another in the cloud
00:12:53.160 | and do all the ai stuff and then the third option is you can go and connect up to something like pipecat
00:12:57.720 | and make it multi-party so bring llms into meetings or other places like that um super awesome how like
00:13:04.440 | you get all this flexibility to build these things the way that like matches your application
00:13:08.200 | so we want to close the video with actually a builder who's in the audience
00:13:13.080 | um super excited and um what's the best part about this is that um when she built this she had
00:13:19.160 | never written any code before this and so what makes me so excited about the future of voice
00:13:23.640 | is we can make this easy enough people that have really innovative inspirational ideas can go and do
00:13:28.280 | stuff themselves hi i'm yashin i'm a mom of two bilingual kids i'm raising my kids bilingual because
00:13:39.480 | i want them to connect with my cultural roots and that's always shared by many parents of bilingual
00:13:44.920 | children but raising bilingual kids is hard it's expensive it's time consuming and it often feels like
00:13:52.600 | a big chore on both the kids and parents i believe we can do a lot better with today's technology i really
00:14:00.600 | believe we can make language education feel more natural and even fun for the kids here's a quick clip of
00:14:07.720 | what i've been working on hi there buddy ready to have some fun today
00:14:26.440 | let's say hello in mandarin we say nǐ hǎo can you say nǐ hǎo nǐ hǎo
00:14:34.920 | okay it's too early but i'm excited about what's possible i'm not technical but with just a little
00:14:43.160 | bit of guidance from a kind member of this community i was able to bring this first version to life i already
00:14:51.560 | have a group of eager testers mostly parents like me who are excited to try this if this also sounds
00:14:59.560 | exciting to you i would love to connect thanks so much for watching and let's build this future together
00:15:12.840 | so we put the qr code up here for yashin's project which i absolutely love she's here uh if you're
00:15:18.520 | interested if you're here in the audience and uh you're interested in multilingual stuff or building
00:15:23.400 | these kind of things please find her it's such a great great great thing um if you're watching on
00:15:28.920 | youtube here's the qr code sean and i are super excited about these kind of projects in i mean the
00:15:34.920 | the person yeah she shouted out in the video is of course sean who has done more than anybody i know
00:15:40.040 | to make web rtc accessible to everyone the idea that we want to leave you with is if you have an idea
00:15:46.440 | in voice ai and web rtc whether you've been a programmer for years or you're just getting started
00:15:51.720 | like we are here to support you we're so excited and we believe that things like pipe cat and live
00:15:57.160 | peer like if we can make this easier we're going to see the next generation of really exciting innovative
00:16:01.880 | projects come find us in the hallway or online we hang out on discord and twitter and linkedin
00:16:06.840 | and uh here are the resources that you can scan and hopefully they find helpful so quinn wrote an
00:16:12.600 | amazing book that i believe is in your bag um and then yeah so thank you so much we can't wait to see
00:16:19.000 | what you build