Why ChatGPT Keeps Interrupting You — Dr. Tom Shapland, LiveKit

00:00:00.000 | I'm gonna jump right into it I'm gonna be talking about voice AI's interruption problem then after

00:00:19.740 | that I'm gonna talk about how we currently handle interruptions and turn-taking in voice AI what we

00:00:25.280 | can learn from the study of human conversation on how to handle how humans

00:00:29.840 | handle turn-taking and then about some of the really neat and interesting new

00:00:33.500 | approaches out there for handling turn-taking and interruptions in voice AI

00:00:38.300 | agents voice excuse me interruptions are the biggest problem in voice AI agents

00:00:44.540 | right now when you're talking to chat GPT advanced voice mode and it interrupts

00:00:49.340 | you it's annoying but when you're talking when a patient is talking to a voice AI

00:00:54.800 | dental assistant and it interrupts the patient the patient hangs up and the

00:00:59.960 | dentist stops paying the voice AI developer this is our collective problem

00:01:03.680 | this is all of our problem that we all have to solve and the problem is like

00:01:09.740 | turn-taking is just hard let me first like define what turn-taking is turn-taking is

00:01:15.080 | this unspoken system we have for who controls the floor on between speakers

00:01:22.040 | during the conversation and fundamentally it's hard because turn-taking happens

00:01:27.500 | really fast in human conversation and there's no one size that fits all so this

00:01:32.060 | is this really cool study that I pulled this data from where they're looking at how

00:01:37.580 | long it took a listener to start responding after the speaker finished

00:01:41.480 | speaking across different cultures and we can see that the Danes take a relatively

00:01:46.700 | long amount of time to start speaking after the other speaker finishes speaking but

00:01:53.780 | the Japanese do it almost instantaneously so there's differences across cultures

00:01:58.280 | that's part of what makes it hard there's also differences across individuals I'm one

00:02:01.880 | of those people that takes a long time to respond even before I got into voice AI

00:02:06.020 | people would sometimes comment being like are you gonna respond I'm like yeah yeah

00:02:09.440 | thinking about it and also like even though I'm one individual there's a lot of

00:02:15.260 | variability in how quickly I respond like if you make me angry I'm probably gonna

00:02:19.460 | respond kind of quicker so it's just a hard problem and in the the next slide I

00:02:25.760 | want to talk about for people are not very familiar with how voice AI agent

00:02:29.840 | pipelines work I'm gonna provide like a simplified overview of how we handle

00:02:33.320 | handle turn-taking and interruptions in voice AI agents currently so the user

00:02:38.480 | starts speaking that's the speech input in that audio those audio chunks are

00:02:42.260 | passed to a speech-to-text model and that speech-to-text model transcribes the

00:02:48.320 | audio into a transcription the next step is something called a VAD that determines

00:02:55.000 | whether or not the user has finished speaking I mean go more into that in a

00:02:59.060 | moment the next step if the user is finished speaking the transcript is

00:03:03.440 | passed to an LLM and the LLM outputs its chat completion that chat completion is

00:03:07.880 | streamed out and that stream is passed to a text-to-speech model where it's

00:03:11.600 | converted into audio and that audio which is the audio of now of the voice AI

00:03:15.440 | agent is passed back to the user let's dig more into the voice activity

00:03:23.000 | detection system it's a system with primarily two parts so it's a machine

00:03:28.280 | learning model a neural network that is detecting whether or not somebody is

00:03:32.720 | speaking so it's like speech or not speech it's pretty simple machine that's

00:03:38.540 | not I shouldn't call it simple but it's it's a really neat model but it's it's

00:03:41.780 | ultimately just looking at speech or not speech and then the next thing next part

00:03:45.720 | of it is a silence algorithm and the silence algorithm is saying okay if the

00:03:50.960 | person hasn't spoken for more than half a second they're done speaking and it's

00:03:55.400 | time for the agent to start speaking so that's the in most production voice AI

00:04:01.400 | systems we're using something like that sort of that and that's changing we're

00:04:05.900 | building all sorts of new interesting things that I'll cover later in the

00:04:08.360 | presentation the next part of my presentation I want to dig into what we

00:04:13.320 | can learn from linguistics and academic research about how human how turn-taking

00:04:17.420 | works in human conversations and one of the lines I read in a paper I read was that I

00:04:22.400 | really liked was that turn-taking in human conversation is a psycho linguistic

00:04:26.540 | puzzle and that we respond in 200 milliseconds but the process of like

00:04:31.040 | finding the words and generating speech and articulating speech takes 600

00:04:35.780 | milliseconds so how can we possibly be speaking so quickly how can the listener

00:04:41.060 | start returning an answer so speak so quickly when it takes much longer to

00:04:46.100 | actually generate the speech and the answer is that there has to be prediction

00:04:49.400 | going on the listener is predicting when the end of turn is going to occur and

00:04:55.780 | then they're gonna they start to generate speech before that end of turn and what

00:04:59.300 | are the primary inputs in creating that prediction the primary inputs on creating

00:05:03.920 | that prediction or the semantic and this one's the most important one like what

00:05:07.280 | the content is of the of what the person is saying other inputs into this you

00:05:11.760 | know prediction algorithm in our head of when we're trying to predict when the

00:05:14.720 | speaker is going to finish speaking is the syntax the structure of the sentence the

00:05:19.100 | prosody like the expressiveness the tone and then also visual cues like most things

00:05:28.100 | in how the human mind works we we actually don't really know like you know that's it's

00:05:33.940 | complicated and but I I would say one of the generally accepted models is what I'm

00:05:40.040 | going to walk through now of how turn-taking works in the human minds it's

00:05:44.040 | broken up into three stages the first stage is semantic prediction so what the

00:05:48.980 | listener is doing and you'll notice this as you speak to other people at the

00:05:52.340 | conference if you like are paying attention to your your thought process is

00:05:56.000 | you're constantly inferring the intended message of the person that's speaking to

00:06:00.620 | you so before they finish speaking you're kind of figuring out wait what are they

00:06:04.960 | trying to say and then you're using what what your prediction of what they're

00:06:08.860 | trying to say to to then use that information to predict when the end of

00:06:16.240 | utterance will occur and you're not doing this just once you're doing this

00:06:19.060 | multiple times again and again right you're constantly updating this prediction

00:06:22.420 | as the speaker keeps going so that's the first stage then the the next stage once it

00:06:29.080 | seems like you have a general idea your prediction is coming true of like when you

00:06:33.380 | think the end of utterance will occur as you start getting closer to that you

00:06:36.800 | start refining that that endpoint prediction based on both the semantics

00:06:42.500 | and the syntax and then as you start getting really as the speaker starts

00:06:45.620 | getting really close to the end of turn the listener finalizes the prediction by

00:06:50.120 | using prosody by using information around like the tone and other acoustic

00:06:54.320 | features so it's three steps a semantic prediction a refinement and a final is a

00:07:01.440 | finalization and one of the things I want to point out is that the human mind is

00:07:06.120 | full duplex we're both processing input and we're starting to generate output at the

00:07:10.100 | same time and I think that's really nicely described in this figure from this

00:07:14.640 | paper and so on the x-axis what we have here is time where the zero millisecond is the

00:07:21.060 | end of the speakers turn and what these different blocks represent is what of

00:07:25.500 | mental processes that are happening inside the mind of the listener and you

00:07:29.520 | can see well before the end of the turn there's this whole comprehension track

00:07:34.140 | that's going on where the listener is inferring the intended message of the

00:07:40.680 | speaker and making predictions about when they're gonna start finish speaking and at

00:07:44.440 | the same time there's also this production or generation track where the

00:07:48.880 | listeners is starting to produce what they're going to say and we're gonna talk

00:07:56.300 | more about full duplex models in computer in silico rather than human minds in a

00:08:01.180 | little bit in my presentation oh Jordan deersley just texted me he texted boo like how does he

00:08:11.680 | know to boo me he's not even here okay it's all good we'll keep rolling so so let's go

00:08:21.620 | back to like let's contrast this really interesting complex process that's going on in the human mind

00:08:28.940 | compared to modern current voice AI systems you'll see it's just so much more simple right

00:08:33.380 | it's just speech or not speech it's looking backwards it's not making a prediction it's done in

00:08:39.200 | serial nothing's happening in parallel so it's it's much more simple and that's part of part of the

00:08:46.220 | problem right of why these interruptions are happening so there's I'm gonna talk through three types of models and the

00:08:54.860 | approaches people are using in these three types of models so the the prevailing model for building voice

00:09:00.860 | AI agents is the cascading model system of models where you have what we talked about earlier speech to text

00:09:07.120 | VAD LLM TTS and what we're doing in those is we're augmenting these new the new approaches to better handling

00:09:15.420 | interruptions it is we're augmenting the VAD with models that look at the semantics syntax or prosody

00:09:21.300 | and I want to jump into an example of it I really have too much content for for my allotted time I'm

00:09:33.920 | looking at down here but maybe I can just take some time from Jordan so let me let me give an example for

00:09:42.420 | one of these semantic type models that is used to augment VAD I'm going to talk about our model at

00:09:48.300 | LiveKit it's a text-based semantic model so what we're doing is we're taking the last four turns of the

00:09:54.240 | conversation as input so that means that it's the the voice AI agents turn then the users turn then the

00:10:01.800 | voice AI agents turn and then the users current turn those are the inputs into a transformer model and what

00:10:09.000 | we're the token that we're predicting because this is an LLM we're predicting the end of utterance token

00:10:14.880 | if that end of utterance token based on the the content right based on the the context and the

00:10:20.760 | semantics of that that input if the end of utterance token is saying that the end of turn it hasn't

00:10:28.200 | happened yet then we don't we extend the silence algorithm part of the VAD and say don't trigger the

00:10:35.760 | end of turn wait longer so they work in concert that's generally the idea of how it works I mean I'll walk through a

00:10:42.360 | quick demo of of how this works in action so in this first part of this demo what Shane is going to do is

00:10:54.240 | talk to a voice agent that is just using the traditional VAD that I was discussing earlier that just looks at

00:11:03.480 | speech or speech or not speech and in the second half of the demo he's using our semantic end of utterance model

00:11:08.760 | 100% sure what it was that I should please I had to I had to build a demo for what was alive kit a live kit

00:11:23.420 | turn detection a turn detection demo where one of the agents like got it what challenge it just it just kept

00:11:35.300 | on interrupting me and being interrupted by the agent constantly that was the worst part for sure I

00:11:41.300 | don't think it was the worst part for me I understand how did you overcome that challenge during the demo

00:11:48.300 | well hopefully we're going to overcome the challenge by using the new turn detection model that live kits

00:11:56.180 | offering so let's try that one now instead hey can you interview me about a time when I had to build a demo

00:12:07.700 | demo absolutely can you share your experience with building a demo yeah definitely so I needed to build a demo but I

00:12:20.880 | wasn't I wasn't a hundred percent sure what it was that I should build and then I I was thinking like probably the best way to show that would be

00:12:35.260 | uh would be side-by-side

00:12:40.480 | yeah thank you I'll I'll let no I'll let Shane know that you all applauded his demo uh he'll appreciate that

00:12:48.440 | yeah you can really see it's a night and day difference when you augment the VAD with models that look at the

00:12:52.940 | semantics and syntax and prosody um so a good segue to my next slide so there's another type of another approach

00:13:01.120 | which people are taking to augmenting the VAD and what they're doing is not just taking the semantic input the text based input

00:13:07.340 | but they're also looking at the audio signal as well and trying to infer things from the acoustic features of the

00:13:12.740 | dialogue so they're taking

00:13:14.300 | uh the basic ideas the input is audio tokens and the output is the probability that the user is finished speaking

00:13:20.340 | um Quinn in the daily team have built their also open weight smart turn model that that uh this is this neat combination of a model that is both transformer and looking at acoustic

00:13:31.400 | um characteristics um and then one of the new things that has just emerged has been um assembly ai dropped their their speech to text new streaming speech to text service earlier this week

00:13:43.560 | and their model is really neat in that it emits it takes audio in and emits out both the transcript and a likelihood that uh the speaker is finished speaking

00:13:54.900 | um so it's one model that's kind of doing both these two things at the same time um and it's also looking at the acoustic features and the semantic features

00:14:03.240 | one of the things i want qtai also has one that they recently released it's pretty neat um one of the things i want to note about this though is if you're using your speech to texts built in end of utterance model it's only seeing half the context

00:14:15.400 | it's only seeing what the user is saying it's not also seeing what the agent is saying so it doesn't quite have the full picture on the context

00:14:21.760 | um but it works remarkably well these all these approaches work remarkably well and are a major step forward from these

00:14:29.600 | more traditional vads and like definitely something that if you're building a voice ai agent after

00:14:34.280 | this conference you should go like implement it they're pretty easy to implement on the different platforms too

00:14:38.880 | um okay so we often talk about in uh uh speech models or in voice ai that um speech models are uh are going to save us

00:14:52.160 | um speech to speech models audio in and audio out are going to save us

00:14:55.620 | um but actually if you look at how these models work that like openai's real-time api they're still using a

00:15:02.980 | vad on the on the internals um so they're still just looking at speech or not speech or you can you can opt to turn on their semantic

00:15:11.180 | they call it semantic vad which is kind of a uh paradox it's not the best term but turning on semantic model that augments the vad

00:15:18.580 | um so to answer the title of my talk of why chat gpt events voice mode keeps interrupting you

00:15:24.020 | it's because it thinks you're done speaking based on how long it's been since you last said a word

00:15:28.100 | um or based on what you've uh what you've said previously um and it's just not quite cutting it

00:15:34.420 | um when those interruptions happen um uh and it's a problem that is not totally solved i want to also

00:15:40.860 | bring that up too that like this is an ongoing problem um with all the different approaches

00:15:46.380 | nothing has perfected it yet um our end of live kit doesn't although we power the transport the audio

00:15:53.180 | layer transport for advanced voice mode um openai is not using our end of utterance model

00:15:59.340 | so the next topic i want to cover is the um i'm running out of time here but uh there's full duplex

00:16:07.180 | models these are really neat so a full duplex model is more like a human mind in that it's processing

00:16:11.500 | input and generating speech at the same time um and as far as i know there's not really any commercial

00:16:17.420 | applications of these um but they're they're fundamentally they're intuitive talkers they're

00:16:22.380 | trained on the raw audio data and the analogy i like to use is that it's like computer vision

00:16:27.180 | in the early days of computer vision we were handwriting algorithms to try to recognize a stop sign based

00:16:32.220 | on the color and the number of sides on it etc and it just didn't work very well but when we started

00:16:37.900 | giving the raw image data to the neural network and let the neural network figure it out all of a sudden

00:16:42.460 | it just started working and i think it's a actually that uh what we learned from computer vision that

00:16:48.300 | really helped us emerge from the ai winter that was a major kind of uh seeding process for where we are

00:16:53.580 | now with ai um and it's a similar analogy with full duplex models in that we're handing them the raw audio

00:16:59.180 | data and we're just letting them figure out how turn taking works rather than trying to hand write all

00:17:03.180 | the rules um but the downside of these models is they're really optimized for like being really

00:17:08.620 | good at turn taking and they're kind of dumb llms they're small models they're not trained on a lot of

00:17:13.180 | data they can't do instruction following very well um and just to give you a sense of like more specifics

00:17:18.540 | of how these models work let's talk about the motion model um what really made it more concrete for

00:17:23.340 | me of how this model works is this idea that is always listening to input and it's always generating

00:17:29.820 | output and even when it's not its turn to speak it's emitting natural silence so it's just basically

00:17:34.700 | emitting silence that you can't hear but it's still always emitting silence so it's always kind of doing

00:17:39.740 | both just like a human is um sync llm which is meta ai's uh full duplex like experimental mode that you can

00:17:47.740 | access inside the app um is a similar full duplex model or it's also a full duplex model something

00:17:54.540 | neat that i want to bring up about sync llm is they're actually in the internals of that model

00:17:59.580 | they're forecasting what the user said saying about five tokens ahead or 200 milliseconds ahead which is

00:18:05.260 | more closely like what humans are doing except we're uh forecasting a much longer time frame

00:18:11.900 | and then lastly my predictions for the future of how we'll solve this problem uh is i think full duplex

00:18:18.540 | models are neat but i don't think they're going to solve the problem like i think we just for for real

00:18:23.820 | production commercial use cases of voice ai we need more control um and we need more control over how it

00:18:29.020 | says things like brand names um and instead what i think is going to happen is we're going to get smarter

00:18:33.980 | and smarter vat augmentations and faster and faster models in the cascade pipeline and we're just going to

00:18:39.020 | have more budget to work with to do a good job with this sort of thing um and the reason i think that's

00:18:45.340 | true is like computers don't do math the same way humans do they don't have the same conceptual way of

00:18:49.980 | thinking about it and uh llms think differently than us and similarly i wouldn't expect voice ai to use the

00:18:55.820 | same mechanisms as the human mind to generate speech and and to talk um thank you all for your attention

00:19:02.860 | this was fun i really appreciate it we do have some time so i don't know if you want to take

00:19:10.220 | q a i would love to we could do that um and i could start with the first question

00:19:16.860 | so the demo you showed there there wasn't any response at the end right i cut off the demo it's

00:19:25.420 | actually a two minute demo right i only have 18 minutes to speak so i truncated on both sides

00:19:29.340 | because i was like okay maybe you just turned everything off and it was an impressive demo

00:19:34.620 | no interruptions no speaking yeah do you have the end of the demo do you want to show it or

00:19:40.700 | um no no worries it's not more of the same idea like what you can see is that shane was like you

00:19:46.620 | know taking his time talking and really pausing and thinking it wasn't interrupting him and then when

00:19:51.420 | it eventually would find his end of turn based on the context cool can we find it on your twitter or

00:19:57.340 | yeah it's on our link our livekit twitter awesome yeah so we can look that up on the livekit twitter

00:20:02.860 | uh awesome yeah we can take some questions i saw you had one hi how important are visual cues for turn

00:20:12.380 | detection in the human context and are there any um is there any development to kind of replicate that in the

00:20:20.860 | voice ai context as well yeah it's a really neat question of like how important are visual cues and

00:20:26.140 | are people working on integrating that into the turn taking um intelligence for uh avatars and real-time

00:20:35.100 | experiences so visual cues are actually despite the fact that we are visual animals um very very much so

00:20:42.380 | like visual is the most visceral like you know input for us visual cues are actually pretty low down the stack of

00:20:49.500 | like uh predictors for when it will be end of turn because it really is semantics that's like one of

00:20:54.620 | the main messages that i want to convey to people from this talk is it's the content of what people

00:20:59.020 | are saying is the main thing we're using to predict when they're going to finish speaking um and then

00:21:03.740 | these visual cues are are and these other ones are ancillary to it um and i'm sure somebody's working on

00:21:09.500 | building something really cool where it's like multimodal and looking at visual cues to look at the

00:21:13.980 | uh infer the end of turn um i'm just haven't seen it yet and can't keep up with all the ai stuff on the

00:21:19.260 | internet yes what is the average cost for usually a voice generated call and then how is what is the

00:21:34.540 | effect when you try to keep regenerating the response

00:21:37.580 | so the question is what's the average cost for voice ai call um and what is the cost when you keep

00:21:47.900 | trying to regenerate the response um so i would i would first say that the i think your your reference

00:21:56.940 | to what is it cost to keep trying to regenerate the response is um what the way i want to answer that is

00:22:03.500 | actually the thing that's most expensive in the pipeline tends to be the text to speech so there's

00:22:07.820 | all these optimizations you can do in the cascade and if you end up hitting the llm multiple times

00:22:11.820 | within a turn it's not it's not all that costly and those sorts of things um there's some really neat

00:22:17.980 | calculators online because i'm personally not a voice ai agent builder and those unit economics don't

00:22:23.900 | don't directly affect me i don't have the numbers off the top of my head but there's some really nice

00:22:28.700 | calculators it's going to depend on how long the conversation is and that sort of thing

00:22:31.900 | uh yeah thanks for the demo that was great uh the question is about your new model that you just

00:22:44.780 | shown and that blew us away so uh one is like why is chat gpt not using that model to improve their stuff and two is it available for us to use now if i were to build a uh voice bot on live kit and uh three maybe

00:23:01.180 | during your development of that one demo is great do you also do some kind of benchmarking with user to

00:23:09.340 | see if you know this is like 50 better or something like that yes so the first question is about why

00:23:17.100 | isn't open ai using our end of utterance model i don't know why they're not i think that's maybe above

00:23:23.500 | my pay grade of this company i just joined four weeks ago um and the second question is like can you

00:23:31.020 | is our end of utterance model available um for use and so it's really easy on our website to uh or it's

00:23:39.660 | really easy to follow our quick start on our website and build a voice ai agent that you can talk to um and

00:23:45.340 | it's just one more line in our in the pipeline that you you build you just turn on you have one more

00:23:51.020 | line in there and you get to use our end of utterance model it's open weight it's you don't have to pay

00:23:56.060 | for it it's just baked in um and our docs show you pretty i think by default it's in there um and the

00:24:02.940 | the third question uh remind me of the third question i'm sorry

00:24:06.460 | did you do benchmark ah yes benchmarking um so we have benchmarks where we have our test data set and

00:24:14.620 | you know the numbers of course look great um but i think uh i was on a long call with our machine

00:24:22.540 | learning team this morning where we spent a lot of time just talking about like how do we get a good

00:24:26.700 | data set for benchmarking that and it's just it's just really it's a tough problem um and i feel like

00:24:32.940 | like the industry as a whole doesn't have a good benchmark around turn taking um and that it's

00:24:38.380 | something that i'm sure will eventually emerge uh okay we'll do one last question i think so there

00:24:45.900 | was yes you in the back not great to be sitting in the back if you want to ask questions but

00:24:51.740 | thank you tom for the diamond and the presentation i got a question related to the back channel so

00:24:59.260 | how did you tackle the back channel challenging the turn taking detection problem so first

00:25:05.020 | for a natural conversational ai the back channel is the one that cannot be ignored and sometimes it

00:25:12.780 | cause trouble for the voice agent to detect whether it is the what it is the end pointing and second for

00:25:19.660 | a typical back channel like yeah um yes those words can occur like a back channel or it can be started as the

00:25:28.460 | uh the agent who should be responding in the in the following period so how the back channel is handled

00:25:35.420 | should be kind of important in this field so um my question was or the question was about not when

00:25:45.100 | the not the case where the ai is interrupting the human but the case where the human is accidentally

00:25:50.860 | interrupting the ai um and i didn't really cover that in my talk i was mostly focusing on the ai

00:25:56.620 | interrupting the human um we don't have our approach is simple like we're just using like the solero

00:26:04.940 | or the the normal vad approach of like if the person is speaking for more than x milliseconds assume

00:26:11.260 | it's not a back channel and that they're actually trying to interrupt the voice ai but one of the things

00:26:15.260 | we want to build is another machine learning model that can like recognize the difference between

00:26:20.220 | whether or not some back channel or someone trying to uh interrupt the the voice ai one quick note on

00:26:28.220 | the full duplex models uh the meta ai one can natively back channel because it's like learned from the raw

00:26:34.460 | audio data so when you're talking to it it'll go uh-huh which is just so neat um and uh yeah it's just a

00:26:42.220 | it's a tough problem the back channeling thing awesome um yeah if you have more questions you

00:26:49.660 | can find tom uh please give another warm uh applause for tom thank you all