back to indexYour realtime AI is ngmi — Sean DuBois (OpenAI), Kwindla Kramer (Daily)

Chapters
0:0 [Voice Keynote] Your realtime AI is ngmi — Sean DuBois (OpenAI), Kwindla Kramer (Daily)
1:29 Introduction to Voice AI and Latency
2:46 Latency Breakdown in a Voice AI Application
3:27 WebRTC vs. WebSockets for Real-Time Audio
6:41 Advantages of WebRTC
7:49 Applications of WebRTC
8:52 Future of Voice AI and User Interfaces
9:59 Squabbert Demo
12:44 Flexibility of WebRTC Connections
13:9 Community Showcase: Yashin's Project
15:46 Call to Action and Resources
00:00:00.000 |
all right squabbert you ready to get packed up I don't know I'm pretty nervous oh relax you got 00:00:21.000 |
nothing to worry about but this is like the worst idea ever for a live demo a live unscripted 00:00:27.320 |
conversation with a non deterministic LLM on conference Wi-Fi come on your prompt is great 00:00:33.620 |
and I mean it's gonna be on stage audio in a room full of echoes sometimes my text-to-speech even 00:00:39.080 |
mispronounces my own name why did you even name me squabbert what if I say squiggly or something again 00:00:45.680 |
okay take a breath well you don't really do that let's just take it one step at a time just start 00:00:52.280 |
with the intro I guess you're right okay here I go hi everybody I'm squabbert here to take us on a 00:01:00.380 |
whirlwind tour of the wonderful world of web RTC please welcome Shawn and Quinn hey I'm Shawn I work 00:01:09.440 |
on web RTC at open AI some of the things you might be familiar with are the real-time API or 1-800 chat 00:01:15.200 |
GPT you can call it right from your phone before I worked at open AI I worked on the go implementation 00:01:20.840 |
of web RTC called pion and I'm Quinn I work at daily on real-time audio and video infrastructure and on 00:01:27.020 |
an open source voice agent framework called pipe cat today we're going to talk about how to build 00:01:31.460 |
natural fast human-like voice experiences we're going to give you a crash course course on low latency 00:01:38.300 |
audio and video and I hope we'll show you a couple of things you might not have thought of around voice AI 00:01:43.220 |
before if you want to build a conversational voice experience that people really love you really you're 00:01:51.740 |
going to stress a lot about latency nothing else matters if your AI responds too slowly building voice 00:01:57.320 |
AI experiences is similar to other kinds of AI engineering in most ways if you've built multi-turn agents a lot 00:02:03.440 |
of that will port over to building voice agents but the big difference is latency everything in a 00:02:12.120 |
voice AI app needs to be ground built from the ground up for fast response times if you're talking to a 00:02:17.060 |
person around 500 milliseconds sounds natural when talking to an AI system people bring those same 00:02:23.300 |
expectations response latencies much above the second in general doom your voice agent to very low 00:02:29.120 |
completion rates and and low NPS scores and hang-ups and we're talking here about voice-to-voice latency so 00:02:34.760 |
this is the time between when I the human stop talking and the time I hear the first audio byte come 00:02:42.120 |
back from the LLM let's take a look at how latency adds up in a typical voice-to-voice AI application so 00:02:50.920 |
this is a breakdown from a real voice AI running in a web browser on Mac OS talking over the internet to 00:02:57.760 |
a voice agent running in the cloud running on pipe cat a couple of things to note our voice latency is 00:03:05.080 |
just under a second that's good but not great we can make things a little faster but that comes with 00:03:10.600 |
trade-offs lower quality or cost second it's frustratingly easy to do even worse than this your LLM 00:03:17.440 |
might be slower other things get in the way or worst of all Bluetooth don't get me started on Bluetooth 00:03:23.200 |
but the single biggest mistake we see people making is using the wrong approach to sending and receiving 00:03:30.520 |
audio over the network it's time to talk about web RTC and web sockets if you're new to building voice 00:03:37.540 |
applications you probably think hey I need a like a long-lived connection I'm gonna send audio and video 00:03:43.540 |
over this long-lived connection I've used web sockets for long-lived data connections before I'm just 00:03:48.520 |
going to write some web sockets code that's great if you're doing long-lived short small amounts of 00:03:54.820 |
data it doesn't work for real-time audio in fact web sockets are almost the opposite of what you want 00:03:59.860 |
from a network engineering perspective for real-time audio and video so let's do a compare and contrast on 00:04:06.460 |
this web sockets are great if you're trying to deliver audio and you want something really easy that can 00:04:14.560 |
target all platforms if you're trying to build a prototype lots of different platforms web sockets the 00:04:18.880 |
way to go on the other hand web RTC solves a bunch of things around handling giving you high quality audio 00:04:25.360 |
high bandwidth low latency but the the catch is it can be a lot more complicated to implement and it 00:04:33.460 |
specializes but it could be frustrating lots of applications use both but for different things so 00:04:40.060 |
here's the TLTR if you only remember one thing from this talk use web sockets for those server to server use cases 00:04:46.340 |
and small amounts of structured data and places that you want a prototype use web RTC if you're sending 00:04:51.320 |
audio and video streams over the internet from your web app your native that's where it excels so why is 00:04:58.280 |
it so important to use web RTC for real-time edge to cloud audio a web socket is a TCP connection TCP 00:05:05.120 |
guarantees in order delivery of network packets if you send some data that data is going to arrive exactly as 00:05:11.180 |
you send it or it's not going to arrive at all you put packets in your operating systems send queue 00:05:16.100 |
that OS queue is going to keep trying to send them until they either get act by the other side or your 00:05:21.400 |
connection completely times out and this is in general what you want if you're doing most network 00:05:26.720 |
programming if you're making a web request for example this is perfect it's not what you want if 00:05:31.420 |
you're aiming for conversational latency remember that we're trying to hit a voice-to-voice latency of 00:05:37.640 |
under one second and ideally even better what we want to ignore is things like the occasional packet loss so 00:05:44.540 |
imagine if a packet is dropped I don't really care about what happened a second ago so web RTC does 00:05:49.340 |
clever math and buffer management that we're going to talk about more to hide where that happens so 00:05:53.960 |
this is the first and most important thing web RTC does for you it's all that machinery that sends 00:05:58.340 |
packets as fast as possible ignores packets that don't arrive inside that very tight latency budget 00:06:03.560 |
we're operating within even think of it as like super fast best effort networking if this were all web RTC 00:06:10.880 |
could do for you compared to web sockets it would still be worth using web RTC just for this because 00:06:16.160 |
you literally can't implement this on top of a TCP stack or on top of web sockets again the operating 00:06:21.440 |
system is just going to try to keep sending whatever you tell it to send it's going to block everything if 00:06:25.520 |
you have any packet loss or significant sort of jitter or delay in the network and in real world we have 00:06:30.920 |
lots and lots of real world data on this in the real world this means you will get audio glitchiness or 00:06:34.740 |
high latency or unexpected socket disconnections in 10 to 15 percent of your network connections 00:06:40.460 |
but web RTC does a lot more than that if you go and you try to build the same application in web sockets 00:06:48.860 |
you have to handle resampling you have to handle packetization and doing all that bandwidth estimation 00:06:54.380 |
networks are constantly changing and fluctuating so you can't just send one bit rate and you also 00:06:59.840 |
get standard API's for getting the stats and observability this is all just built into web RTC but if 00:07:05.660 |
you decide to do web sockets you have to build it yourself so if you look at this code up on the 00:07:09.980 |
screen on the right side is an example of web RTC sending one bi-directional stream of audio on the 00:07:15.980 |
left side is web sockets and you want to spend more time building your application and less time 00:07:20.660 |
worrying about things like sample rates that's why you pick web RTC and this is real code using your 00:07:26.060 |
open AI real-time API which you you offer both both options for developers so I hope we've convinced you 00:07:32.300 |
that you should use web RTC if you're doing edge to cloud audio and especially audio and video we love talking 00:07:39.500 |
about this stuff if you come find us later we will talk your ear off about jitter buffers and packet 00:07:43.580 |
management and bandwidth shaping and all that stuff but what we want to do now is move on and talk about 00:07:48.140 |
a whole nother category of fun stuff which is what you can actually do with web RTC I'll start by saying 00:07:52.940 |
you can embed real-time audio in any app you write any website any iOS app any Android app lots of fun 00:07:59.180 |
embedded stuff and the network connection connections will just work you will get good audio on any device any 00:08:05.720 |
platform almost any real-world network connection and I bet you use web RTC web RTC today already if 00:08:11.480 |
you used Facebook Messenger WhatsApp Zoom Discord you know any of these applications they're using web RTC 00:08:17.240 |
but you didn't know that there's even more cool things happening with web RTC I worked with a company 00:08:23.000 |
that was doing surgery over the internet people will tell it up can vehicles in the field it's super cool 00:08:30.520 |
web RTC is kind of the standard language of the real-time world and that's why it makes so easy 00:08:35.800 |
that we can go build conversational intelligence on top of it all the stuff's already been solved I mean 00:08:40.920 |
in the new LLM era we know lots of people who spend hours talking to computers driving their developments 00:08:46.520 |
environments with voice doing brainstorming treating the computer as a personal assistant a coach a therapist 00:08:52.120 |
a researcher I'm convinced voice is going to be the core building block of the next generation of UIs of 00:08:58.600 |
the UIs for the generative AI era we have to do a little iteration before we figure out really what 00:09:04.200 |
those UIs look and sound like but that future seems very clear to me one of the things I say to people 00:09:10.360 |
to kind of try to communicate how excited I am about building all this stuff is we all lived through the 00:09:16.440 |
last platform shift it was recent enough that the move for most computing from desktop to mobile happened 00:09:23.080 |
within our memories and so by analogy we are in late 2007 now we have the first iPhones but we haven't yet 00:09:32.120 |
invented pull to refresh what keeps me so excited and motivated is I feel like voice is like the next 00:09:39.480 |
bicycle for the mind you know today we've only been able to use computers with our eyes and our hands 00:09:44.200 |
but now I can talk to it as well think of all those situations where like you you have your voice 00:09:50.120 |
available you don't have your hands and the other great thing is you can keep all this computing power 00:09:55.640 |
remotely and access it and have small devices near you so on the small devices front we brought a friend 00:10:02.680 |
with us today uh he's a little shy sometimes so give him just a second 00:10:07.160 |
so uh squabbert can you tell people a little bit about your tech stack 00:10:23.000 |
i'm a friendly stuffed animal running a web interface locally on a raspberry pi i'm connected directly to a python 00:10:45.640 |
process running on a laptop using mlx whisper gemma 3 and a mildly buggy custom logic sampler that quinn wrote 00:10:56.360 |
i mean come on now like i had a good reason that i wanted to write a logit sampler and i i i wanted you to be able 00:11:03.560 |
to do something that even the really amazing big llms running in the cloud can't do do you know what i'm 00:11:10.280 |
you're talking about my syllable counting talent it's pretty neat even if i do say so myself 00:11:18.760 |
i mean counting syllables is like the 2025 version of how many r's are in strawberry and you know 00:11:26.040 |
scrub why don't we why don't we demonstrate that a little bit can you make up a poem like a four-line poem 00:11:32.520 |
about computer programming and use only two syllable words 00:11:36.200 |
two okay listen logic coding people knowing system working future growing 00:11:47.480 |
one more maybe uh that was good but i've heard you do better 00:11:57.320 |
okay here's another one pixels dancing colors bright program running feels so right coding faster day and 00:12:05.560 |
you forgot about the two syllable words thing but we'll we'll we'll let it go yeah definitely the 00:12:10.520 |
first one had two civil words i think i think we should say goodbye just in the interest of time 00:12:23.320 |
i think squabber did really really well um so that that is a raspberry pi connected with a peer-to-peer 00:12:30.680 |
web rtc connection directly to my laptop over the same local area network so that was the serverless 00:12:37.560 |
web rtc connection squabber it's talking directly to the laptop but what's super cool about web rtc is you 00:12:42.600 |
have all these different choices to how you want to connect to things so you could do this local 00:12:46.680 |
connection or something like squabber could connect to a server um up running on another in the cloud 00:12:53.160 |
and do all the ai stuff and then the third option is you can go and connect up to something like pipecat 00:12:57.720 |
and make it multi-party so bring llms into meetings or other places like that um super awesome how like 00:13:04.440 |
you get all this flexibility to build these things the way that like matches your application 00:13:08.200 |
so we want to close the video with actually a builder who's in the audience 00:13:13.080 |
um super excited and um what's the best part about this is that um when she built this she had 00:13:19.160 |
never written any code before this and so what makes me so excited about the future of voice 00:13:23.640 |
is we can make this easy enough people that have really innovative inspirational ideas can go and do 00:13:28.280 |
stuff themselves hi i'm yashin i'm a mom of two bilingual kids i'm raising my kids bilingual because 00:13:39.480 |
i want them to connect with my cultural roots and that's always shared by many parents of bilingual 00:13:44.920 |
children but raising bilingual kids is hard it's expensive it's time consuming and it often feels like 00:13:52.600 |
a big chore on both the kids and parents i believe we can do a lot better with today's technology i really 00:14:00.600 |
believe we can make language education feel more natural and even fun for the kids here's a quick clip of 00:14:07.720 |
what i've been working on hi there buddy ready to have some fun today 00:14:26.440 |
let's say hello in mandarin we say nǐ hǎo can you say nǐ hǎo nǐ hǎo 00:14:34.920 |
okay it's too early but i'm excited about what's possible i'm not technical but with just a little 00:14:43.160 |
bit of guidance from a kind member of this community i was able to bring this first version to life i already 00:14:51.560 |
have a group of eager testers mostly parents like me who are excited to try this if this also sounds 00:14:59.560 |
exciting to you i would love to connect thanks so much for watching and let's build this future together 00:15:12.840 |
so we put the qr code up here for yashin's project which i absolutely love she's here uh if you're 00:15:18.520 |
interested if you're here in the audience and uh you're interested in multilingual stuff or building 00:15:23.400 |
these kind of things please find her it's such a great great great thing um if you're watching on 00:15:28.920 |
youtube here's the qr code sean and i are super excited about these kind of projects in i mean the 00:15:34.920 |
the person yeah she shouted out in the video is of course sean who has done more than anybody i know 00:15:40.040 |
to make web rtc accessible to everyone the idea that we want to leave you with is if you have an idea 00:15:46.440 |
in voice ai and web rtc whether you've been a programmer for years or you're just getting started 00:15:51.720 |
like we are here to support you we're so excited and we believe that things like pipe cat and live 00:15:57.160 |
peer like if we can make this easier we're going to see the next generation of really exciting innovative 00:16:01.880 |
projects come find us in the hallway or online we hang out on discord and twitter and linkedin 00:16:06.840 |
and uh here are the resources that you can scan and hopefully they find helpful so quinn wrote an 00:16:12.600 |
amazing book that i believe is in your bag um and then yeah so thank you so much we can't wait to see