Your realtime AI is ngmi — Sean DuBois (OpenAI), Kwindla Kramer (Daily)

00:00:00.000 | all right squabbert you ready to get packed up I don't know I'm pretty nervous oh relax you got

00:00:21.000 | nothing to worry about but this is like the worst idea ever for a live demo a live unscripted

00:00:27.320 | conversation with a non deterministic LLM on conference Wi-Fi come on your prompt is great

00:00:33.620 | and I mean it's gonna be on stage audio in a room full of echoes sometimes my text-to-speech even

00:00:39.080 | mispronounces my own name why did you even name me squabbert what if I say squiggly or something again

00:00:45.680 | okay take a breath well you don't really do that let's just take it one step at a time just start

00:00:52.280 | with the intro I guess you're right okay here I go hi everybody I'm squabbert here to take us on a

00:01:00.380 | whirlwind tour of the wonderful world of web RTC please welcome Shawn and Quinn hey I'm Shawn I work

00:01:09.440 | on web RTC at open AI some of the things you might be familiar with are the real-time API or 1-800 chat

00:01:15.200 | GPT you can call it right from your phone before I worked at open AI I worked on the go implementation

00:01:20.840 | of web RTC called pion and I'm Quinn I work at daily on real-time audio and video infrastructure and on

00:01:27.020 | an open source voice agent framework called pipe cat today we're going to talk about how to build

00:01:31.460 | natural fast human-like voice experiences we're going to give you a crash course course on low latency

00:01:38.300 | audio and video and I hope we'll show you a couple of things you might not have thought of around voice AI

00:01:43.220 | before if you want to build a conversational voice experience that people really love you really you're

00:01:51.740 | going to stress a lot about latency nothing else matters if your AI responds too slowly building voice

00:01:57.320 | AI experiences is similar to other kinds of AI engineering in most ways if you've built multi-turn agents a lot

00:02:03.440 | of that will port over to building voice agents but the big difference is latency everything in a

00:02:12.120 | voice AI app needs to be ground built from the ground up for fast response times if you're talking to a

00:02:17.060 | person around 500 milliseconds sounds natural when talking to an AI system people bring those same

00:02:23.300 | expectations response latencies much above the second in general doom your voice agent to very low

00:02:29.120 | completion rates and and low NPS scores and hang-ups and we're talking here about voice-to-voice latency so

00:02:34.760 | this is the time between when I the human stop talking and the time I hear the first audio byte come

00:02:42.120 | back from the LLM let's take a look at how latency adds up in a typical voice-to-voice AI application so

00:02:50.920 | this is a breakdown from a real voice AI running in a web browser on Mac OS talking over the internet to

00:02:57.760 | a voice agent running in the cloud running on pipe cat a couple of things to note our voice latency is

00:03:05.080 | just under a second that's good but not great we can make things a little faster but that comes with

00:03:10.600 | trade-offs lower quality or cost second it's frustratingly easy to do even worse than this your LLM

00:03:17.440 | might be slower other things get in the way or worst of all Bluetooth don't get me started on Bluetooth

00:03:23.200 | but the single biggest mistake we see people making is using the wrong approach to sending and receiving

00:03:30.520 | audio over the network it's time to talk about web RTC and web sockets if you're new to building voice

00:03:37.540 | applications you probably think hey I need a like a long-lived connection I'm gonna send audio and video

00:03:43.540 | over this long-lived connection I've used web sockets for long-lived data connections before I'm just

00:03:48.520 | going to write some web sockets code that's great if you're doing long-lived short small amounts of

00:03:54.820 | data it doesn't work for real-time audio in fact web sockets are almost the opposite of what you want

00:03:59.860 | from a network engineering perspective for real-time audio and video so let's do a compare and contrast on

00:04:06.460 | this web sockets are great if you're trying to deliver audio and you want something really easy that can

00:04:14.560 | target all platforms if you're trying to build a prototype lots of different platforms web sockets the

00:04:18.880 | way to go on the other hand web RTC solves a bunch of things around handling giving you high quality audio

00:04:25.360 | high bandwidth low latency but the the catch is it can be a lot more complicated to implement and it

00:04:33.460 | specializes but it could be frustrating lots of applications use both but for different things so

00:04:40.060 | here's the TLTR if you only remember one thing from this talk use web sockets for those server to server use cases

00:04:46.340 | and small amounts of structured data and places that you want a prototype use web RTC if you're sending

00:04:51.320 | audio and video streams over the internet from your web app your native that's where it excels so why is

00:04:58.280 | it so important to use web RTC for real-time edge to cloud audio a web socket is a TCP connection TCP

00:05:05.120 | guarantees in order delivery of network packets if you send some data that data is going to arrive exactly as

00:05:11.180 | you send it or it's not going to arrive at all you put packets in your operating systems send queue

00:05:16.100 | that OS queue is going to keep trying to send them until they either get act by the other side or your

00:05:21.400 | connection completely times out and this is in general what you want if you're doing most network

00:05:26.720 | programming if you're making a web request for example this is perfect it's not what you want if

00:05:31.420 | you're aiming for conversational latency remember that we're trying to hit a voice-to-voice latency of

00:05:37.640 | under one second and ideally even better what we want to ignore is things like the occasional packet loss so

00:05:44.540 | imagine if a packet is dropped I don't really care about what happened a second ago so web RTC does

00:05:49.340 | clever math and buffer management that we're going to talk about more to hide where that happens so

00:05:53.960 | this is the first and most important thing web RTC does for you it's all that machinery that sends

00:05:58.340 | packets as fast as possible ignores packets that don't arrive inside that very tight latency budget

00:06:03.560 | we're operating within even think of it as like super fast best effort networking if this were all web RTC

00:06:10.880 | could do for you compared to web sockets it would still be worth using web RTC just for this because

00:06:16.160 | you literally can't implement this on top of a TCP stack or on top of web sockets again the operating

00:06:21.440 | system is just going to try to keep sending whatever you tell it to send it's going to block everything if

00:06:25.520 | you have any packet loss or significant sort of jitter or delay in the network and in real world we have

00:06:30.920 | lots and lots of real world data on this in the real world this means you will get audio glitchiness or

00:06:34.740 | high latency or unexpected socket disconnections in 10 to 15 percent of your network connections

00:06:40.460 | but web RTC does a lot more than that if you go and you try to build the same application in web sockets

00:06:48.860 | you have to handle resampling you have to handle packetization and doing all that bandwidth estimation

00:06:54.380 | networks are constantly changing and fluctuating so you can't just send one bit rate and you also

00:06:59.840 | get standard API's for getting the stats and observability this is all just built into web RTC but if

00:07:05.660 | you decide to do web sockets you have to build it yourself so if you look at this code up on the

00:07:09.980 | screen on the right side is an example of web RTC sending one bi-directional stream of audio on the

00:07:15.980 | left side is web sockets and you want to spend more time building your application and less time

00:07:20.660 | worrying about things like sample rates that's why you pick web RTC and this is real code using your

00:07:26.060 | open AI real-time API which you you offer both both options for developers so I hope we've convinced you

00:07:32.300 | that you should use web RTC if you're doing edge to cloud audio and especially audio and video we love talking

00:07:39.500 | about this stuff if you come find us later we will talk your ear off about jitter buffers and packet

00:07:43.580 | management and bandwidth shaping and all that stuff but what we want to do now is move on and talk about

00:07:48.140 | a whole nother category of fun stuff which is what you can actually do with web RTC I'll start by saying

00:07:52.940 | you can embed real-time audio in any app you write any website any iOS app any Android app lots of fun

00:07:59.180 | embedded stuff and the network connection connections will just work you will get good audio on any device any

00:08:05.720 | platform almost any real-world network connection and I bet you use web RTC web RTC today already if

00:08:11.480 | you used Facebook Messenger WhatsApp Zoom Discord you know any of these applications they're using web RTC

00:08:17.240 | but you didn't know that there's even more cool things happening with web RTC I worked with a company

00:08:23.000 | that was doing surgery over the internet people will tell it up can vehicles in the field it's super cool

00:08:30.520 | web RTC is kind of the standard language of the real-time world and that's why it makes so easy

00:08:35.800 | that we can go build conversational intelligence on top of it all the stuff's already been solved I mean

00:08:40.920 | in the new LLM era we know lots of people who spend hours talking to computers driving their developments

00:08:46.520 | environments with voice doing brainstorming treating the computer as a personal assistant a coach a therapist

00:08:52.120 | a researcher I'm convinced voice is going to be the core building block of the next generation of UIs of

00:08:58.600 | the UIs for the generative AI era we have to do a little iteration before we figure out really what

00:09:04.200 | those UIs look and sound like but that future seems very clear to me one of the things I say to people

00:09:10.360 | to kind of try to communicate how excited I am about building all this stuff is we all lived through the

00:09:16.440 | last platform shift it was recent enough that the move for most computing from desktop to mobile happened

00:09:23.080 | within our memories and so by analogy we are in late 2007 now we have the first iPhones but we haven't yet

00:09:32.120 | invented pull to refresh what keeps me so excited and motivated is I feel like voice is like the next

00:09:39.480 | bicycle for the mind you know today we've only been able to use computers with our eyes and our hands

00:09:44.200 | but now I can talk to it as well think of all those situations where like you you have your voice

00:09:50.120 | available you don't have your hands and the other great thing is you can keep all this computing power

00:09:55.640 | remotely and access it and have small devices near you so on the small devices front we brought a friend

00:10:02.680 | with us today uh he's a little shy sometimes so give him just a second

00:10:07.160 | so uh squabbert can you tell people a little bit about your tech stack

00:10:23.000 | i'm a friendly stuffed animal running a web interface locally on a raspberry pi i'm connected directly to a python

00:10:45.640 | process running on a laptop using mlx whisper gemma 3 and a mildly buggy custom logic sampler that quinn wrote

00:10:56.360 | i mean come on now like i had a good reason that i wanted to write a logit sampler and i i i wanted you to be able

00:11:03.560 | to do something that even the really amazing big llms running in the cloud can't do do you know what i'm

00:11:09.800 | talking about

00:11:10.280 | you're talking about my syllable counting talent it's pretty neat even if i do say so myself

00:11:18.760 | i mean counting syllables is like the 2025 version of how many r's are in strawberry and you know

00:11:26.040 | scrub why don't we why don't we demonstrate that a little bit can you make up a poem like a four-line poem

00:11:32.520 | about computer programming and use only two syllable words

00:11:36.200 | two okay listen logic coding people knowing system working future growing

00:11:47.480 | one more maybe uh that was good but i've heard you do better

00:11:57.320 | okay here's another one pixels dancing colors bright program running feels so right coding faster day and

00:12:03.720 | night making magic pure delight

00:12:05.560 | you forgot about the two syllable words thing but we'll we'll we'll let it go yeah definitely the

00:12:10.520 | first one had two civil words i think i think we should say goodbye just in the interest of time

00:12:14.200 | squabber but thank you so much

00:12:23.320 | i think squabber did really really well um so that that is a raspberry pi connected with a peer-to-peer

00:12:30.680 | web rtc connection directly to my laptop over the same local area network so that was the serverless

00:12:37.560 | web rtc connection squabber it's talking directly to the laptop but what's super cool about web rtc is you

00:12:42.600 | have all these different choices to how you want to connect to things so you could do this local

00:12:46.680 | connection or something like squabber could connect to a server um up running on another in the cloud

00:12:53.160 | and do all the ai stuff and then the third option is you can go and connect up to something like pipecat

00:12:57.720 | and make it multi-party so bring llms into meetings or other places like that um super awesome how like

00:13:04.440 | you get all this flexibility to build these things the way that like matches your application

00:13:08.200 | so we want to close the video with actually a builder who's in the audience

00:13:13.080 | um super excited and um what's the best part about this is that um when she built this she had

00:13:19.160 | never written any code before this and so what makes me so excited about the future of voice

00:13:23.640 | is we can make this easy enough people that have really innovative inspirational ideas can go and do

00:13:28.280 | stuff themselves hi i'm yashin i'm a mom of two bilingual kids i'm raising my kids bilingual because

00:13:39.480 | i want them to connect with my cultural roots and that's always shared by many parents of bilingual

00:13:44.920 | children but raising bilingual kids is hard it's expensive it's time consuming and it often feels like

00:13:52.600 | a big chore on both the kids and parents i believe we can do a lot better with today's technology i really

00:14:00.600 | believe we can make language education feel more natural and even fun for the kids here's a quick clip of

00:14:07.720 | what i've been working on hi there buddy ready to have some fun today

00:14:26.440 | let's say hello in mandarin we say nǐ hǎo can you say nǐ hǎo nǐ hǎo

00:14:34.920 | okay it's too early but i'm excited about what's possible i'm not technical but with just a little

00:14:43.160 | bit of guidance from a kind member of this community i was able to bring this first version to life i already

00:14:51.560 | have a group of eager testers mostly parents like me who are excited to try this if this also sounds

00:14:59.560 | exciting to you i would love to connect thanks so much for watching and let's build this future together

00:15:12.840 | so we put the qr code up here for yashin's project which i absolutely love she's here uh if you're

00:15:18.520 | interested if you're here in the audience and uh you're interested in multilingual stuff or building

00:15:23.400 | these kind of things please find her it's such a great great great thing um if you're watching on

00:15:28.920 | youtube here's the qr code sean and i are super excited about these kind of projects in i mean the

00:15:34.920 | the person yeah she shouted out in the video is of course sean who has done more than anybody i know

00:15:40.040 | to make web rtc accessible to everyone the idea that we want to leave you with is if you have an idea

00:15:46.440 | in voice ai and web rtc whether you've been a programmer for years or you're just getting started

00:15:51.720 | like we are here to support you we're so excited and we believe that things like pipe cat and live

00:15:57.160 | peer like if we can make this easier we're going to see the next generation of really exciting innovative

00:16:01.880 | projects come find us in the hallway or online we hang out on discord and twitter and linkedin

00:16:06.840 | and uh here are the resources that you can scan and hopefully they find helpful so quinn wrote an

00:16:12.600 | amazing book that i believe is in your bag um and then yeah so thank you so much we can't wait to see

00:16:19.000 | what you build

Your realtime AI is ngmi — Sean DuBois (OpenAI), Kwindla Kramer (Daily)

Chapters