Why ChatGPT Keeps Interrupting You — Dr. Tom Shapland, LiveKit

I'm gonna jump right into it I'm gonna be talking about voice AI's interruption problem then after that I'm gonna talk about how we currently handle interruptions and turn-taking in voice AI what we can learn from the study of human conversation on how to handle how humans handle turn-taking and then about some of the really neat and interesting new approaches out there for handling turn-taking and interruptions in voice AI agents voice excuse me interruptions are the biggest problem in voice AI agents right now when you're talking to chat GPT advanced voice mode and it interrupts you it's annoying but when you're talking when a patient is talking to a voice AI dental assistant and it interrupts the patient the patient hangs up and the dentist stops paying the voice AI developer this is our collective problem this is all of our problem that we all have to solve and the problem is like turn-taking is just hard let me first like define what turn-taking is turn-taking is this unspoken system we have for who controls the floor on between speakers during the conversation and fundamentally it's hard because turn-taking happens really fast in human conversation and there's no one size that fits all so this is this really cool study that I pulled this data from where they're looking at how long it took a listener to start responding after the speaker finished speaking across different cultures and we can see that the Danes take a relatively long amount of time to start speaking after the other speaker finishes speaking but the Japanese do it almost instantaneously so there's differences across cultures that's part of what makes it hard there's also differences across individuals I'm one of those people that takes a long time to respond even before I got into voice AI people would sometimes comment being like are you gonna respond I'm like yeah yeah thinking about it and also like even though I'm one individual there's a lot of variability in how quickly I respond like if you make me angry I'm probably gonna respond kind of quicker so it's just a hard problem and in the the next slide I want to talk about for people are not very familiar with how voice AI agent pipelines work I'm gonna provide like a simplified overview of how we handle handle turn-taking and interruptions in voice AI agents currently so the user starts speaking that's the speech input in that audio those audio chunks are passed to a speech-to-text model and that speech-to-text model transcribes the audio into a transcription the next step is something called a VAD that determines whether or not the user has finished speaking I mean go more into that in a moment the next step if the user is finished speaking the transcript is passed to an LLM and the LLM outputs its chat completion that chat completion is streamed out and that stream is passed to a text-to-speech model where it's converted into audio and that audio which is the audio of now of the voice AI agent is passed back to the user let's dig more into the voice activity detection system it's a system with primarily two parts so it's a machine learning model a neural network that is detecting whether or not somebody is speaking so it's like speech or not speech it's pretty simple machine that's not I shouldn't call it simple but it's it's a really neat model but it's it's ultimately just looking at speech or not speech and then the next thing next part of it is a silence algorithm and the silence algorithm is saying okay if the person hasn't spoken for more than half a second they're done speaking and it's time for the agent to start speaking so that's the in most production voice AI systems we're using something like that sort of that and that's changing we're building all sorts of new interesting things that I'll cover later in the presentation the next part of my presentation I want to dig into what we can learn from linguistics and academic research about how human how turn-taking works in human conversations and one of the lines I read in a paper I read was that I really liked was that turn-taking in human conversation is a psycho linguistic puzzle and that we respond in 200 milliseconds but the process of like finding the words and generating speech and articulating speech takes 600 milliseconds so how can we possibly be speaking so quickly how can the listener start returning an answer so speak so quickly when it takes much longer to actually generate the speech and the answer is that there has to be prediction going on the listener is predicting when the end of turn is going to occur and then they're gonna they start to generate speech before that end of turn and what are the primary inputs in creating that prediction the primary inputs on creating that prediction or the semantic and this one's the most important one like what the content is of the of what the person is saying other inputs into this you know prediction algorithm in our head of when we're trying to predict when the speaker is going to finish speaking is the syntax the structure of the sentence the prosody like the expressiveness the tone and then also visual cues like most things in how the human mind works we we actually don't really know like you know that's it's complicated and but I I would say one of the generally accepted models is what I'm going to walk through now of how turn-taking works in the human minds it's broken up into three stages the first stage is semantic prediction so what the listener is doing and you'll notice this as you speak to other people at the conference if you like are paying attention to your your thought process is you're constantly inferring the intended message of the person that's speaking to you so before they finish speaking you're kind of figuring out wait what are they trying to say and then you're using what what your prediction of what they're trying to say to to then use that information to predict when the end of utterance will occur and you're not doing this just once you're doing this multiple times again and again right you're constantly updating this prediction as the speaker keeps going so that's the first stage then the the next stage once it seems like you have a general idea your prediction is coming true of like when you think the end of utterance will occur as you start getting closer to that you start refining that that endpoint prediction based on both the semantics and the syntax and then as you start getting really as the speaker starts getting really close to the end of turn the listener finalizes the prediction by using prosody by using information around like the tone and other acoustic features so it's three steps a semantic prediction a refinement and a final is a finalization and one of the things I want to point out is that the human mind is full duplex we're both processing input and we're starting to generate output at the same time and I think that's really nicely described in this figure from this paper and so on the x-axis what we have here is time where the zero millisecond is the end of the speakers turn and what these different blocks represent is what of mental processes that are happening inside the mind of the listener and you can see well before the end of the turn there's this whole comprehension track that's going on where the listener is inferring the intended message of the speaker and making predictions about when they're gonna start finish speaking and at the same time there's also this production or generation track where the listeners is starting to produce what they're going to say and we're gonna talk more about full duplex models in computer in silico rather than human minds in a little bit in my presentation oh Jordan deersley just texted me he texted boo like how does he know to boo me he's not even here okay it's all good we'll keep rolling so so let's go back to like let's contrast this really interesting complex process that's going on in the human mind compared to modern current voice AI systems you'll see it's just so much more simple right it's just speech or not speech it's looking backwards it's not making a prediction it's done in serial nothing's happening in parallel so it's it's much more simple and that's part of part of the problem right of why these interruptions are happening so there's I'm gonna talk through three types of models and the approaches people are using in these three types of models so the the prevailing model for building voice AI agents is the cascading model system of models where you have what we talked about earlier speech to text VAD LLM TTS and what we're doing in those is we're augmenting these new the new approaches to better handling interruptions it is we're augmenting the VAD with models that look at the semantics syntax or prosody and I want to jump into an example of it I really have too much content for for my allotted time I'm looking at down here but maybe I can just take some time from Jordan so let me let me give an example for one of these semantic type models that is used to augment VAD I'm going to talk about our model at LiveKit it's a text-based semantic model so what we're doing is we're taking the last four turns of the conversation as input so that means that it's the the voice AI agents turn then the users turn then the voice AI agents turn and then the users current turn those are the inputs into a transformer model and what we're the token that we're predicting because this is an LLM we're predicting the end of utterance token if that end of utterance token based on the the content right based on the the context and the semantics of that that input if the end of utterance token is saying that the end of turn it hasn't happened yet then we don't we extend the silence algorithm part of the VAD and say don't trigger the end of turn wait longer so they work in concert that's generally the idea of how it works I mean I'll walk through a quick demo of of how this works in action so in this first part of this demo what Shane is going to do is talk to a voice agent that is just using the traditional VAD that I was discussing earlier that just looks at speech or speech or not speech and in the second half of the demo he's using our semantic end of utterance model 100% sure what it was that I should please I had to I had to build a demo for what was alive kit a live kit turn detection a turn detection demo where one of the agents like got it what challenge it just it just kept on interrupting me and being interrupted by the agent constantly that was the worst part for sure I don't think it was the worst part for me I understand how did you overcome that challenge during the demo well hopefully we're going to overcome the challenge by using the new turn detection model that live kits offering so let's try that one now instead hey can you interview me about a time when I had to build a demo demo absolutely can you share your experience with building a demo yeah definitely so I needed to build a demo but I wasn't I wasn't a hundred percent sure what it was that I should build and then I I was thinking like probably the best way to show that would be uh would be side-by-side yeah thank you I'll I'll let no I'll let Shane know that you all applauded his demo uh he'll appreciate that yeah you can really see it's a night and day difference when you augment the VAD with models that look at the semantics and syntax and prosody um so a good segue to my next slide so there's another type of another approach which people are taking to augmenting the VAD and what they're doing is not just taking the semantic input the text based input but they're also looking at the audio signal as well and trying to infer things from the acoustic features of the dialogue so they're taking uh the basic ideas the input is audio tokens and the output is the probability that the user is finished speaking um Quinn in the daily team have built their also open weight smart turn model that that uh this is this neat combination of a model that is both transformer and looking at acoustic um characteristics um and then one of the new things that has just emerged has been um assembly ai dropped their their speech to text new streaming speech to text service earlier this week and their model is really neat in that it emits it takes audio in and emits out both the transcript and a likelihood that uh the speaker is finished speaking um so it's one model that's kind of doing both these two things at the same time um and it's also looking at the acoustic features and the semantic features one of the things i want qtai also has one that they recently released it's pretty neat um one of the things i want to note about this though is if you're using your speech to texts built in end of utterance model it's only seeing half the context it's only seeing what the user is saying it's not also seeing what the agent is saying so it doesn't quite have the full picture on the context um but it works remarkably well these all these approaches work remarkably well and are a major step forward from these more traditional vads and like definitely something that if you're building a voice ai agent after this conference you should go like implement it they're pretty easy to implement on the different platforms too um okay so we often talk about in uh uh speech models or in voice ai that um speech models are uh are going to save us um speech to speech models audio in and audio out are going to save us um but actually if you look at how these models work that like openai's real-time api they're still using a vad on the on the internals um so they're still just looking at speech or not speech or you can you can opt to turn on their semantic they call it semantic vad which is kind of a uh paradox it's not the best term but turning on semantic model that augments the vad um so to answer the title of my talk of why chat gpt events voice mode keeps interrupting you it's because it thinks you're done speaking based on how long it's been since you last said a word um or based on what you've uh what you've said previously um and it's just not quite cutting it um when those interruptions happen um uh and it's a problem that is not totally solved i want to also bring that up too that like this is an ongoing problem um with all the different approaches nothing has perfected it yet um our end of live kit doesn't although we power the transport the audio layer transport for advanced voice mode um openai is not using our end of utterance model so the next topic i want to cover is the um i'm running out of time here but uh there's full duplex models these are really neat so a full duplex model is more like a human mind in that it's processing input and generating speech at the same time um and as far as i know there's not really any commercial applications of these um but they're they're fundamentally they're intuitive talkers they're trained on the raw audio data and the analogy i like to use is that it's like computer vision in the early days of computer vision we were handwriting algorithms to try to recognize a stop sign based on the color and the number of sides on it etc and it just didn't work very well but when we started giving the raw image data to the neural network and let the neural network figure it out all of a sudden it just started working and i think it's a actually that uh what we learned from computer vision that really helped us emerge from the ai winter that was a major kind of uh seeding process for where we are now with ai um and it's a similar analogy with full duplex models in that we're handing them the raw audio data and we're just letting them figure out how turn taking works rather than trying to hand write all the rules um but the downside of these models is they're really optimized for like being really good at turn taking and they're kind of dumb llms they're small models they're not trained on a lot of data they can't do instruction following very well um and just to give you a sense of like more specifics of how these models work let's talk about the motion model um what really made it more concrete for me of how this model works is this idea that is always listening to input and it's always generating output and even when it's not its turn to speak it's emitting natural silence so it's just basically emitting silence that you can't hear but it's still always emitting silence so it's always kind of doing both just like a human is um sync llm which is meta ai's uh full duplex like experimental mode that you can access inside the app um is a similar full duplex model or it's also a full duplex model something neat that i want to bring up about sync llm is they're actually in the internals of that model they're forecasting what the user said saying about five tokens ahead or 200 milliseconds ahead which is more closely like what humans are doing except we're uh forecasting a much longer time frame and then lastly my predictions for the future of how we'll solve this problem uh is i think full duplex models are neat but i don't think they're going to solve the problem like i think we just for for real production commercial use cases of voice ai we need more control um and we need more control over how it says things like brand names um and instead what i think is going to happen is we're going to get smarter and smarter vat augmentations and faster and faster models in the cascade pipeline and we're just going to have more budget to work with to do a good job with this sort of thing um and the reason i think that's true is like computers don't do math the same way humans do they don't have the same conceptual way of thinking about it and uh llms think differently than us and similarly i wouldn't expect voice ai to use the same mechanisms as the human mind to generate speech and and to talk um thank you all for your attention this was fun i really appreciate it we do have some time so i don't know if you want to take q a i would love to we could do that um and i could start with the first question so the demo you showed there there wasn't any response at the end right i cut off the demo it's actually a two minute demo right i only have 18 minutes to speak so i truncated on both sides because i was like okay maybe you just turned everything off and it was an impressive demo no interruptions no speaking yeah do you have the end of the demo do you want to show it or um no no worries it's not more of the same idea like what you can see is that shane was like you know taking his time talking and really pausing and thinking it wasn't interrupting him and then when it eventually would find his end of turn based on the context cool can we find it on your twitter or yeah it's on our link our livekit twitter awesome yeah so we can look that up on the livekit twitter uh awesome yeah we can take some questions i saw you had one hi how important are visual cues for turn detection in the human context and are there any um is there any development to kind of replicate that in the voice ai context as well yeah it's a really neat question of like how important are visual cues and are people working on integrating that into the turn taking um intelligence for uh avatars and real-time experiences so visual cues are actually despite the fact that we are visual animals um very very much so like visual is the most visceral like you know input for us visual cues are actually pretty low down the stack of like uh predictors for when it will be end of turn because it really is semantics that's like one of the main messages that i want to convey to people from this talk is it's the content of what people are saying is the main thing we're using to predict when they're going to finish speaking um and then these visual cues are are and these other ones are ancillary to it um and i'm sure somebody's working on building something really cool where it's like multimodal and looking at visual cues to look at the uh infer the end of turn um i'm just haven't seen it yet and can't keep up with all the ai stuff on the internet yes what is the average cost for usually a voice generated call and then how is what is the effect when you try to keep regenerating the response so the question is what's the average cost for voice ai call um and what is the cost when you keep trying to regenerate the response um so i would i would first say that the i think your your reference to what is it cost to keep trying to regenerate the response is um what the way i want to answer that is actually the thing that's most expensive in the pipeline tends to be the text to speech so there's all these optimizations you can do in the cascade and if you end up hitting the llm multiple times within a turn it's not it's not all that costly and those sorts of things um there's some really neat calculators online because i'm personally not a voice ai agent builder and those unit economics don't don't directly affect me i don't have the numbers off the top of my head but there's some really nice calculators it's going to depend on how long the conversation is and that sort of thing uh yeah thanks for the demo that was great uh the question is about your new model that you just shown and that blew us away so uh one is like why is chat gpt not using that model to improve their stuff and two is it available for us to use now if i were to build a uh voice bot on live kit and uh three maybe during your development of that one demo is great do you also do some kind of benchmarking with user to see if you know this is like 50 better or something like that yes so the first question is about why isn't open ai using our end of utterance model i don't know why they're not i think that's maybe above my pay grade of this company i just joined four weeks ago um and the second question is like can you is our end of utterance model available um for use and so it's really easy on our website to uh or it's really easy to follow our quick start on our website and build a voice ai agent that you can talk to um and it's just one more line in our in the pipeline that you you build you just turn on you have one more line in there and you get to use our end of utterance model it's open weight it's you don't have to pay for it it's just baked in um and our docs show you pretty i think by default it's in there um and the the third question uh remind me of the third question i'm sorry did you do benchmark ah yes benchmarking um so we have benchmarks where we have our test data set and you know the numbers of course look great um but i think uh i was on a long call with our machine learning team this morning where we spent a lot of time just talking about like how do we get a good data set for benchmarking that and it's just it's just really it's a tough problem um and i feel like like the industry as a whole doesn't have a good benchmark around turn taking um and that it's something that i'm sure will eventually emerge uh okay we'll do one last question i think so there was yes you in the back not great to be sitting in the back if you want to ask questions but thank you tom for the diamond and the presentation i got a question related to the back channel so how did you tackle the back channel challenging the turn taking detection problem so first for a natural conversational ai the back channel is the one that cannot be ignored and sometimes it cause trouble for the voice agent to detect whether it is the what it is the end pointing and second for a typical back channel like yeah um yes those words can occur like a back channel or it can be started as the uh the agent who should be responding in the in the following period so how the back channel is handled should be kind of important in this field so um my question was or the question was about not when the not the case where the ai is interrupting the human but the case where the human is accidentally interrupting the ai um and i didn't really cover that in my talk i was mostly focusing on the ai interrupting the human um we don't have our approach is simple like we're just using like the solero or the the normal vad approach of like if the person is speaking for more than x milliseconds assume it's not a back channel and that they're actually trying to interrupt the voice ai but one of the things we want to build is another machine learning model that can like recognize the difference between whether or not some back channel or someone trying to uh interrupt the the voice ai one quick note on the full duplex models uh the meta ai one can natively back channel because it's like learned from the raw audio data so when you're talking to it it'll go uh-huh which is just so neat um and uh yeah it's just a it's a tough problem the back channeling thing awesome um yeah if you have more questions you can find tom uh please give another warm uh applause for tom thank you all

Why ChatGPT Keeps Interrupting You — Dr. Tom Shapland, LiveKit

Transcript