back to index

Why ChatGPT Keeps Interrupting You — Dr. Tom Shapland, LiveKit


Whisper Transcript | Transcript Only Page

00:00:00.000 | I'm gonna jump right into it I'm gonna be talking about voice AI's interruption problem then after
00:00:19.740 | that I'm gonna talk about how we currently handle interruptions and turn-taking in voice AI what we
00:00:25.280 | can learn from the study of human conversation on how to handle how humans
00:00:29.840 | handle turn-taking and then about some of the really neat and interesting new
00:00:33.500 | approaches out there for handling turn-taking and interruptions in voice AI
00:00:38.300 | agents voice excuse me interruptions are the biggest problem in voice AI agents
00:00:44.540 | right now when you're talking to chat GPT advanced voice mode and it interrupts
00:00:49.340 | you it's annoying but when you're talking when a patient is talking to a voice AI
00:00:54.800 | dental assistant and it interrupts the patient the patient hangs up and the
00:00:59.960 | dentist stops paying the voice AI developer this is our collective problem
00:01:03.680 | this is all of our problem that we all have to solve and the problem is like
00:01:09.740 | turn-taking is just hard let me first like define what turn-taking is turn-taking is
00:01:15.080 | this unspoken system we have for who controls the floor on between speakers
00:01:22.040 | during the conversation and fundamentally it's hard because turn-taking happens
00:01:27.500 | really fast in human conversation and there's no one size that fits all so this
00:01:32.060 | is this really cool study that I pulled this data from where they're looking at how
00:01:37.580 | long it took a listener to start responding after the speaker finished
00:01:41.480 | speaking across different cultures and we can see that the Danes take a relatively
00:01:46.700 | long amount of time to start speaking after the other speaker finishes speaking but
00:01:53.780 | the Japanese do it almost instantaneously so there's differences across cultures
00:01:58.280 | that's part of what makes it hard there's also differences across individuals I'm one
00:02:01.880 | of those people that takes a long time to respond even before I got into voice AI
00:02:06.020 | people would sometimes comment being like are you gonna respond I'm like yeah yeah
00:02:09.440 | thinking about it and also like even though I'm one individual there's a lot of
00:02:15.260 | variability in how quickly I respond like if you make me angry I'm probably gonna
00:02:19.460 | respond kind of quicker so it's just a hard problem and in the the next slide I
00:02:25.760 | want to talk about for people are not very familiar with how voice AI agent
00:02:29.840 | pipelines work I'm gonna provide like a simplified overview of how we handle
00:02:33.320 | handle turn-taking and interruptions in voice AI agents currently so the user
00:02:38.480 | starts speaking that's the speech input in that audio those audio chunks are
00:02:42.260 | passed to a speech-to-text model and that speech-to-text model transcribes the
00:02:48.320 | audio into a transcription the next step is something called a VAD that determines
00:02:55.000 | whether or not the user has finished speaking I mean go more into that in a
00:02:59.060 | moment the next step if the user is finished speaking the transcript is
00:03:03.440 | passed to an LLM and the LLM outputs its chat completion that chat completion is
00:03:07.880 | streamed out and that stream is passed to a text-to-speech model where it's
00:03:11.600 | converted into audio and that audio which is the audio of now of the voice AI
00:03:15.440 | agent is passed back to the user let's dig more into the voice activity
00:03:23.000 | detection system it's a system with primarily two parts so it's a machine
00:03:28.280 | learning model a neural network that is detecting whether or not somebody is
00:03:32.720 | speaking so it's like speech or not speech it's pretty simple machine that's
00:03:38.540 | not I shouldn't call it simple but it's it's a really neat model but it's it's
00:03:41.780 | ultimately just looking at speech or not speech and then the next thing next part
00:03:45.720 | of it is a silence algorithm and the silence algorithm is saying okay if the
00:03:50.960 | person hasn't spoken for more than half a second they're done speaking and it's
00:03:55.400 | time for the agent to start speaking so that's the in most production voice AI
00:04:01.400 | systems we're using something like that sort of that and that's changing we're
00:04:05.900 | building all sorts of new interesting things that I'll cover later in the
00:04:08.360 | presentation the next part of my presentation I want to dig into what we
00:04:13.320 | can learn from linguistics and academic research about how human how turn-taking
00:04:17.420 | works in human conversations and one of the lines I read in a paper I read was that I
00:04:22.400 | really liked was that turn-taking in human conversation is a psycho linguistic
00:04:26.540 | puzzle and that we respond in 200 milliseconds but the process of like
00:04:31.040 | finding the words and generating speech and articulating speech takes 600
00:04:35.780 | milliseconds so how can we possibly be speaking so quickly how can the listener
00:04:41.060 | start returning an answer so speak so quickly when it takes much longer to
00:04:46.100 | actually generate the speech and the answer is that there has to be prediction
00:04:49.400 | going on the listener is predicting when the end of turn is going to occur and
00:04:55.780 | then they're gonna they start to generate speech before that end of turn and what
00:04:59.300 | are the primary inputs in creating that prediction the primary inputs on creating
00:05:03.920 | that prediction or the semantic and this one's the most important one like what
00:05:07.280 | the content is of the of what the person is saying other inputs into this you
00:05:11.760 | know prediction algorithm in our head of when we're trying to predict when the
00:05:14.720 | speaker is going to finish speaking is the syntax the structure of the sentence the
00:05:19.100 | prosody like the expressiveness the tone and then also visual cues like most things
00:05:28.100 | in how the human mind works we we actually don't really know like you know that's it's
00:05:33.940 | complicated and but I I would say one of the generally accepted models is what I'm
00:05:40.040 | going to walk through now of how turn-taking works in the human minds it's
00:05:44.040 | broken up into three stages the first stage is semantic prediction so what the
00:05:48.980 | listener is doing and you'll notice this as you speak to other people at the
00:05:52.340 | conference if you like are paying attention to your your thought process is
00:05:56.000 | you're constantly inferring the intended message of the person that's speaking to
00:06:00.620 | you so before they finish speaking you're kind of figuring out wait what are they
00:06:04.960 | trying to say and then you're using what what your prediction of what they're
00:06:08.860 | trying to say to to then use that information to predict when the end of
00:06:16.240 | utterance will occur and you're not doing this just once you're doing this
00:06:19.060 | multiple times again and again right you're constantly updating this prediction
00:06:22.420 | as the speaker keeps going so that's the first stage then the the next stage once it
00:06:29.080 | seems like you have a general idea your prediction is coming true of like when you
00:06:33.380 | think the end of utterance will occur as you start getting closer to that you
00:06:36.800 | start refining that that endpoint prediction based on both the semantics
00:06:42.500 | and the syntax and then as you start getting really as the speaker starts
00:06:45.620 | getting really close to the end of turn the listener finalizes the prediction by
00:06:50.120 | using prosody by using information around like the tone and other acoustic
00:06:54.320 | features so it's three steps a semantic prediction a refinement and a final is a
00:07:01.440 | finalization and one of the things I want to point out is that the human mind is
00:07:06.120 | full duplex we're both processing input and we're starting to generate output at the
00:07:10.100 | same time and I think that's really nicely described in this figure from this
00:07:14.640 | paper and so on the x-axis what we have here is time where the zero millisecond is the
00:07:21.060 | end of the speakers turn and what these different blocks represent is what of
00:07:25.500 | mental processes that are happening inside the mind of the listener and you
00:07:29.520 | can see well before the end of the turn there's this whole comprehension track
00:07:34.140 | that's going on where the listener is inferring the intended message of the
00:07:40.680 | speaker and making predictions about when they're gonna start finish speaking and at
00:07:44.440 | the same time there's also this production or generation track where the
00:07:48.880 | listeners is starting to produce what they're going to say and we're gonna talk
00:07:56.300 | more about full duplex models in computer in silico rather than human minds in a
00:08:01.180 | little bit in my presentation oh Jordan deersley just texted me he texted boo like how does he
00:08:11.680 | know to boo me he's not even here okay it's all good we'll keep rolling so so let's go
00:08:21.620 | back to like let's contrast this really interesting complex process that's going on in the human mind
00:08:28.940 | compared to modern current voice AI systems you'll see it's just so much more simple right
00:08:33.380 | it's just speech or not speech it's looking backwards it's not making a prediction it's done in
00:08:39.200 | serial nothing's happening in parallel so it's it's much more simple and that's part of part of the
00:08:46.220 | problem right of why these interruptions are happening so there's I'm gonna talk through three types of models and the
00:08:54.860 | approaches people are using in these three types of models so the the prevailing model for building voice
00:09:00.860 | AI agents is the cascading model system of models where you have what we talked about earlier speech to text
00:09:07.120 | VAD LLM TTS and what we're doing in those is we're augmenting these new the new approaches to better handling
00:09:15.420 | interruptions it is we're augmenting the VAD with models that look at the semantics syntax or prosody
00:09:21.300 | and I want to jump into an example of it I really have too much content for for my allotted time I'm
00:09:33.920 | looking at down here but maybe I can just take some time from Jordan so let me let me give an example for
00:09:42.420 | one of these semantic type models that is used to augment VAD I'm going to talk about our model at
00:09:48.300 | LiveKit it's a text-based semantic model so what we're doing is we're taking the last four turns of the
00:09:54.240 | conversation as input so that means that it's the the voice AI agents turn then the users turn then the
00:10:01.800 | voice AI agents turn and then the users current turn those are the inputs into a transformer model and what
00:10:09.000 | we're the token that we're predicting because this is an LLM we're predicting the end of utterance token
00:10:14.880 | if that end of utterance token based on the the content right based on the the context and the
00:10:20.760 | semantics of that that input if the end of utterance token is saying that the end of turn it hasn't
00:10:28.200 | happened yet then we don't we extend the silence algorithm part of the VAD and say don't trigger the
00:10:35.760 | end of turn wait longer so they work in concert that's generally the idea of how it works I mean I'll walk through a
00:10:42.360 | quick demo of of how this works in action so in this first part of this demo what Shane is going to do is
00:10:54.240 | talk to a voice agent that is just using the traditional VAD that I was discussing earlier that just looks at
00:11:03.480 | speech or speech or not speech and in the second half of the demo he's using our semantic end of utterance model
00:11:08.760 | 100% sure what it was that I should please I had to I had to build a demo for what was alive kit a live kit
00:11:23.420 | turn detection a turn detection demo where one of the agents like got it what challenge it just it just kept
00:11:35.300 | on interrupting me and being interrupted by the agent constantly that was the worst part for sure I
00:11:41.300 | don't think it was the worst part for me I understand how did you overcome that challenge during the demo
00:11:48.300 | well hopefully we're going to overcome the challenge by using the new turn detection model that live kits
00:11:56.180 | offering so let's try that one now instead hey can you interview me about a time when I had to build a demo
00:12:07.700 | demo absolutely can you share your experience with building a demo yeah definitely so I needed to build a demo but I
00:12:20.880 | wasn't I wasn't a hundred percent sure what it was that I should build and then I I was thinking like probably the best way to show that would be
00:12:35.260 | uh would be side-by-side
00:12:40.480 | yeah thank you I'll I'll let no I'll let Shane know that you all applauded his demo uh he'll appreciate that
00:12:48.440 | yeah you can really see it's a night and day difference when you augment the VAD with models that look at the
00:12:52.940 | semantics and syntax and prosody um so a good segue to my next slide so there's another type of another approach
00:13:01.120 | which people are taking to augmenting the VAD and what they're doing is not just taking the semantic input the text based input
00:13:07.340 | but they're also looking at the audio signal as well and trying to infer things from the acoustic features of the
00:13:12.740 | dialogue so they're taking
00:13:14.300 | uh the basic ideas the input is audio tokens and the output is the probability that the user is finished speaking
00:13:20.340 | um Quinn in the daily team have built their also open weight smart turn model that that uh this is this neat combination of a model that is both transformer and looking at acoustic
00:13:31.400 | um characteristics um and then one of the new things that has just emerged has been um assembly ai dropped their their speech to text new streaming speech to text service earlier this week
00:13:43.560 | and their model is really neat in that it emits it takes audio in and emits out both the transcript and a likelihood that uh the speaker is finished speaking
00:13:54.900 | um so it's one model that's kind of doing both these two things at the same time um and it's also looking at the acoustic features and the semantic features
00:14:03.240 | one of the things i want qtai also has one that they recently released it's pretty neat um one of the things i want to note about this though is if you're using your speech to texts built in end of utterance model it's only seeing half the context
00:14:15.400 | it's only seeing what the user is saying it's not also seeing what the agent is saying so it doesn't quite have the full picture on the context
00:14:21.760 | um but it works remarkably well these all these approaches work remarkably well and are a major step forward from these
00:14:29.600 | more traditional vads and like definitely something that if you're building a voice ai agent after
00:14:34.280 | this conference you should go like implement it they're pretty easy to implement on the different platforms too
00:14:38.880 | um okay so we often talk about in uh uh speech models or in voice ai that um speech models are uh are going to save us
00:14:52.160 | um speech to speech models audio in and audio out are going to save us
00:14:55.620 | um but actually if you look at how these models work that like openai's real-time api they're still using a
00:15:02.980 | vad on the on the internals um so they're still just looking at speech or not speech or you can you can opt to turn on their semantic
00:15:11.180 | they call it semantic vad which is kind of a uh paradox it's not the best term but turning on semantic model that augments the vad
00:15:18.580 | um so to answer the title of my talk of why chat gpt events voice mode keeps interrupting you
00:15:24.020 | it's because it thinks you're done speaking based on how long it's been since you last said a word
00:15:28.100 | um or based on what you've uh what you've said previously um and it's just not quite cutting it
00:15:34.420 | um when those interruptions happen um uh and it's a problem that is not totally solved i want to also
00:15:40.860 | bring that up too that like this is an ongoing problem um with all the different approaches
00:15:46.380 | nothing has perfected it yet um our end of live kit doesn't although we power the transport the audio
00:15:53.180 | layer transport for advanced voice mode um openai is not using our end of utterance model
00:15:59.340 | so the next topic i want to cover is the um i'm running out of time here but uh there's full duplex
00:16:07.180 | models these are really neat so a full duplex model is more like a human mind in that it's processing
00:16:11.500 | input and generating speech at the same time um and as far as i know there's not really any commercial
00:16:17.420 | applications of these um but they're they're fundamentally they're intuitive talkers they're
00:16:22.380 | trained on the raw audio data and the analogy i like to use is that it's like computer vision
00:16:27.180 | in the early days of computer vision we were handwriting algorithms to try to recognize a stop sign based
00:16:32.220 | on the color and the number of sides on it etc and it just didn't work very well but when we started
00:16:37.900 | giving the raw image data to the neural network and let the neural network figure it out all of a sudden
00:16:42.460 | it just started working and i think it's a actually that uh what we learned from computer vision that
00:16:48.300 | really helped us emerge from the ai winter that was a major kind of uh seeding process for where we are
00:16:53.580 | now with ai um and it's a similar analogy with full duplex models in that we're handing them the raw audio
00:16:59.180 | data and we're just letting them figure out how turn taking works rather than trying to hand write all
00:17:03.180 | the rules um but the downside of these models is they're really optimized for like being really
00:17:08.620 | good at turn taking and they're kind of dumb llms they're small models they're not trained on a lot of
00:17:13.180 | data they can't do instruction following very well um and just to give you a sense of like more specifics
00:17:18.540 | of how these models work let's talk about the motion model um what really made it more concrete for
00:17:23.340 | me of how this model works is this idea that is always listening to input and it's always generating
00:17:29.820 | output and even when it's not its turn to speak it's emitting natural silence so it's just basically
00:17:34.700 | emitting silence that you can't hear but it's still always emitting silence so it's always kind of doing
00:17:39.740 | both just like a human is um sync llm which is meta ai's uh full duplex like experimental mode that you can
00:17:47.740 | access inside the app um is a similar full duplex model or it's also a full duplex model something
00:17:54.540 | neat that i want to bring up about sync llm is they're actually in the internals of that model
00:17:59.580 | they're forecasting what the user said saying about five tokens ahead or 200 milliseconds ahead which is
00:18:05.260 | more closely like what humans are doing except we're uh forecasting a much longer time frame
00:18:11.900 | and then lastly my predictions for the future of how we'll solve this problem uh is i think full duplex
00:18:18.540 | models are neat but i don't think they're going to solve the problem like i think we just for for real
00:18:23.820 | production commercial use cases of voice ai we need more control um and we need more control over how it
00:18:29.020 | says things like brand names um and instead what i think is going to happen is we're going to get smarter
00:18:33.980 | and smarter vat augmentations and faster and faster models in the cascade pipeline and we're just going to
00:18:39.020 | have more budget to work with to do a good job with this sort of thing um and the reason i think that's
00:18:45.340 | true is like computers don't do math the same way humans do they don't have the same conceptual way of
00:18:49.980 | thinking about it and uh llms think differently than us and similarly i wouldn't expect voice ai to use the
00:18:55.820 | same mechanisms as the human mind to generate speech and and to talk um thank you all for your attention
00:19:02.860 | this was fun i really appreciate it we do have some time so i don't know if you want to take
00:19:10.220 | q a i would love to we could do that um and i could start with the first question
00:19:16.860 | so the demo you showed there there wasn't any response at the end right i cut off the demo it's
00:19:25.420 | actually a two minute demo right i only have 18 minutes to speak so i truncated on both sides
00:19:29.340 | because i was like okay maybe you just turned everything off and it was an impressive demo
00:19:34.620 | no interruptions no speaking yeah do you have the end of the demo do you want to show it or
00:19:40.700 | um no no worries it's not more of the same idea like what you can see is that shane was like you
00:19:46.620 | know taking his time talking and really pausing and thinking it wasn't interrupting him and then when
00:19:51.420 | it eventually would find his end of turn based on the context cool can we find it on your twitter or
00:19:57.340 | yeah it's on our link our livekit twitter awesome yeah so we can look that up on the livekit twitter
00:20:02.860 | uh awesome yeah we can take some questions i saw you had one hi how important are visual cues for turn
00:20:12.380 | detection in the human context and are there any um is there any development to kind of replicate that in the
00:20:20.860 | voice ai context as well yeah it's a really neat question of like how important are visual cues and
00:20:26.140 | are people working on integrating that into the turn taking um intelligence for uh avatars and real-time
00:20:35.100 | experiences so visual cues are actually despite the fact that we are visual animals um very very much so
00:20:42.380 | like visual is the most visceral like you know input for us visual cues are actually pretty low down the stack of
00:20:49.500 | like uh predictors for when it will be end of turn because it really is semantics that's like one of
00:20:54.620 | the main messages that i want to convey to people from this talk is it's the content of what people
00:20:59.020 | are saying is the main thing we're using to predict when they're going to finish speaking um and then
00:21:03.740 | these visual cues are are and these other ones are ancillary to it um and i'm sure somebody's working on
00:21:09.500 | building something really cool where it's like multimodal and looking at visual cues to look at the
00:21:13.980 | uh infer the end of turn um i'm just haven't seen it yet and can't keep up with all the ai stuff on the
00:21:19.260 | internet yes what is the average cost for usually a voice generated call and then how is what is the
00:21:34.540 | effect when you try to keep regenerating the response
00:21:37.580 | so the question is what's the average cost for voice ai call um and what is the cost when you keep
00:21:47.900 | trying to regenerate the response um so i would i would first say that the i think your your reference
00:21:56.940 | to what is it cost to keep trying to regenerate the response is um what the way i want to answer that is
00:22:03.500 | actually the thing that's most expensive in the pipeline tends to be the text to speech so there's
00:22:07.820 | all these optimizations you can do in the cascade and if you end up hitting the llm multiple times
00:22:11.820 | within a turn it's not it's not all that costly and those sorts of things um there's some really neat
00:22:17.980 | calculators online because i'm personally not a voice ai agent builder and those unit economics don't
00:22:23.900 | don't directly affect me i don't have the numbers off the top of my head but there's some really nice
00:22:28.700 | calculators it's going to depend on how long the conversation is and that sort of thing
00:22:31.900 | uh yeah thanks for the demo that was great uh the question is about your new model that you just
00:22:44.780 | shown and that blew us away so uh one is like why is chat gpt not using that model to improve their stuff and two is it available for us to use now if i were to build a uh voice bot on live kit and uh three maybe
00:23:01.180 | during your development of that one demo is great do you also do some kind of benchmarking with user to
00:23:09.340 | see if you know this is like 50 better or something like that yes so the first question is about why
00:23:17.100 | isn't open ai using our end of utterance model i don't know why they're not i think that's maybe above
00:23:23.500 | my pay grade of this company i just joined four weeks ago um and the second question is like can you
00:23:31.020 | is our end of utterance model available um for use and so it's really easy on our website to uh or it's
00:23:39.660 | really easy to follow our quick start on our website and build a voice ai agent that you can talk to um and
00:23:45.340 | it's just one more line in our in the pipeline that you you build you just turn on you have one more
00:23:51.020 | line in there and you get to use our end of utterance model it's open weight it's you don't have to pay
00:23:56.060 | for it it's just baked in um and our docs show you pretty i think by default it's in there um and the
00:24:02.940 | the third question uh remind me of the third question i'm sorry
00:24:06.460 | did you do benchmark ah yes benchmarking um so we have benchmarks where we have our test data set and
00:24:14.620 | you know the numbers of course look great um but i think uh i was on a long call with our machine
00:24:22.540 | learning team this morning where we spent a lot of time just talking about like how do we get a good
00:24:26.700 | data set for benchmarking that and it's just it's just really it's a tough problem um and i feel like
00:24:32.940 | like the industry as a whole doesn't have a good benchmark around turn taking um and that it's
00:24:38.380 | something that i'm sure will eventually emerge uh okay we'll do one last question i think so there
00:24:45.900 | was yes you in the back not great to be sitting in the back if you want to ask questions but
00:24:51.740 | thank you tom for the diamond and the presentation i got a question related to the back channel so
00:24:59.260 | how did you tackle the back channel challenging the turn taking detection problem so first
00:25:05.020 | for a natural conversational ai the back channel is the one that cannot be ignored and sometimes it
00:25:12.780 | cause trouble for the voice agent to detect whether it is the what it is the end pointing and second for
00:25:19.660 | a typical back channel like yeah um yes those words can occur like a back channel or it can be started as the
00:25:28.460 | uh the agent who should be responding in the in the following period so how the back channel is handled
00:25:35.420 | should be kind of important in this field so um my question was or the question was about not when
00:25:45.100 | the not the case where the ai is interrupting the human but the case where the human is accidentally
00:25:50.860 | interrupting the ai um and i didn't really cover that in my talk i was mostly focusing on the ai
00:25:56.620 | interrupting the human um we don't have our approach is simple like we're just using like the solero
00:26:04.940 | or the the normal vad approach of like if the person is speaking for more than x milliseconds assume
00:26:11.260 | it's not a back channel and that they're actually trying to interrupt the voice ai but one of the things
00:26:15.260 | we want to build is another machine learning model that can like recognize the difference between
00:26:20.220 | whether or not some back channel or someone trying to uh interrupt the the voice ai one quick note on
00:26:28.220 | the full duplex models uh the meta ai one can natively back channel because it's like learned from the raw
00:26:34.460 | audio data so when you're talking to it it'll go uh-huh which is just so neat um and uh yeah it's just a
00:26:42.220 | it's a tough problem the back channeling thing awesome um yeah if you have more questions you
00:26:49.660 | can find tom uh please give another warm uh applause for tom thank you all