back to indexWhy ChatGPT Keeps Interrupting You — Dr. Tom Shapland, LiveKit

00:00:00.000 |
I'm gonna jump right into it I'm gonna be talking about voice AI's interruption problem then after 00:00:19.740 |
that I'm gonna talk about how we currently handle interruptions and turn-taking in voice AI what we 00:00:25.280 |
can learn from the study of human conversation on how to handle how humans 00:00:29.840 |
handle turn-taking and then about some of the really neat and interesting new 00:00:33.500 |
approaches out there for handling turn-taking and interruptions in voice AI 00:00:38.300 |
agents voice excuse me interruptions are the biggest problem in voice AI agents 00:00:44.540 |
right now when you're talking to chat GPT advanced voice mode and it interrupts 00:00:49.340 |
you it's annoying but when you're talking when a patient is talking to a voice AI 00:00:54.800 |
dental assistant and it interrupts the patient the patient hangs up and the 00:00:59.960 |
dentist stops paying the voice AI developer this is our collective problem 00:01:03.680 |
this is all of our problem that we all have to solve and the problem is like 00:01:09.740 |
turn-taking is just hard let me first like define what turn-taking is turn-taking is 00:01:15.080 |
this unspoken system we have for who controls the floor on between speakers 00:01:22.040 |
during the conversation and fundamentally it's hard because turn-taking happens 00:01:27.500 |
really fast in human conversation and there's no one size that fits all so this 00:01:32.060 |
is this really cool study that I pulled this data from where they're looking at how 00:01:37.580 |
long it took a listener to start responding after the speaker finished 00:01:41.480 |
speaking across different cultures and we can see that the Danes take a relatively 00:01:46.700 |
long amount of time to start speaking after the other speaker finishes speaking but 00:01:53.780 |
the Japanese do it almost instantaneously so there's differences across cultures 00:01:58.280 |
that's part of what makes it hard there's also differences across individuals I'm one 00:02:01.880 |
of those people that takes a long time to respond even before I got into voice AI 00:02:06.020 |
people would sometimes comment being like are you gonna respond I'm like yeah yeah 00:02:09.440 |
thinking about it and also like even though I'm one individual there's a lot of 00:02:15.260 |
variability in how quickly I respond like if you make me angry I'm probably gonna 00:02:19.460 |
respond kind of quicker so it's just a hard problem and in the the next slide I 00:02:25.760 |
want to talk about for people are not very familiar with how voice AI agent 00:02:29.840 |
pipelines work I'm gonna provide like a simplified overview of how we handle 00:02:33.320 |
handle turn-taking and interruptions in voice AI agents currently so the user 00:02:38.480 |
starts speaking that's the speech input in that audio those audio chunks are 00:02:42.260 |
passed to a speech-to-text model and that speech-to-text model transcribes the 00:02:48.320 |
audio into a transcription the next step is something called a VAD that determines 00:02:55.000 |
whether or not the user has finished speaking I mean go more into that in a 00:02:59.060 |
moment the next step if the user is finished speaking the transcript is 00:03:03.440 |
passed to an LLM and the LLM outputs its chat completion that chat completion is 00:03:07.880 |
streamed out and that stream is passed to a text-to-speech model where it's 00:03:11.600 |
converted into audio and that audio which is the audio of now of the voice AI 00:03:15.440 |
agent is passed back to the user let's dig more into the voice activity 00:03:23.000 |
detection system it's a system with primarily two parts so it's a machine 00:03:28.280 |
learning model a neural network that is detecting whether or not somebody is 00:03:32.720 |
speaking so it's like speech or not speech it's pretty simple machine that's 00:03:38.540 |
not I shouldn't call it simple but it's it's a really neat model but it's it's 00:03:41.780 |
ultimately just looking at speech or not speech and then the next thing next part 00:03:45.720 |
of it is a silence algorithm and the silence algorithm is saying okay if the 00:03:50.960 |
person hasn't spoken for more than half a second they're done speaking and it's 00:03:55.400 |
time for the agent to start speaking so that's the in most production voice AI 00:04:01.400 |
systems we're using something like that sort of that and that's changing we're 00:04:05.900 |
building all sorts of new interesting things that I'll cover later in the 00:04:08.360 |
presentation the next part of my presentation I want to dig into what we 00:04:13.320 |
can learn from linguistics and academic research about how human how turn-taking 00:04:17.420 |
works in human conversations and one of the lines I read in a paper I read was that I 00:04:22.400 |
really liked was that turn-taking in human conversation is a psycho linguistic 00:04:26.540 |
puzzle and that we respond in 200 milliseconds but the process of like 00:04:31.040 |
finding the words and generating speech and articulating speech takes 600 00:04:35.780 |
milliseconds so how can we possibly be speaking so quickly how can the listener 00:04:41.060 |
start returning an answer so speak so quickly when it takes much longer to 00:04:46.100 |
actually generate the speech and the answer is that there has to be prediction 00:04:49.400 |
going on the listener is predicting when the end of turn is going to occur and 00:04:55.780 |
then they're gonna they start to generate speech before that end of turn and what 00:04:59.300 |
are the primary inputs in creating that prediction the primary inputs on creating 00:05:03.920 |
that prediction or the semantic and this one's the most important one like what 00:05:07.280 |
the content is of the of what the person is saying other inputs into this you 00:05:11.760 |
know prediction algorithm in our head of when we're trying to predict when the 00:05:14.720 |
speaker is going to finish speaking is the syntax the structure of the sentence the 00:05:19.100 |
prosody like the expressiveness the tone and then also visual cues like most things 00:05:28.100 |
in how the human mind works we we actually don't really know like you know that's it's 00:05:33.940 |
complicated and but I I would say one of the generally accepted models is what I'm 00:05:40.040 |
going to walk through now of how turn-taking works in the human minds it's 00:05:44.040 |
broken up into three stages the first stage is semantic prediction so what the 00:05:48.980 |
listener is doing and you'll notice this as you speak to other people at the 00:05:52.340 |
conference if you like are paying attention to your your thought process is 00:05:56.000 |
you're constantly inferring the intended message of the person that's speaking to 00:06:00.620 |
you so before they finish speaking you're kind of figuring out wait what are they 00:06:04.960 |
trying to say and then you're using what what your prediction of what they're 00:06:08.860 |
trying to say to to then use that information to predict when the end of 00:06:16.240 |
utterance will occur and you're not doing this just once you're doing this 00:06:19.060 |
multiple times again and again right you're constantly updating this prediction 00:06:22.420 |
as the speaker keeps going so that's the first stage then the the next stage once it 00:06:29.080 |
seems like you have a general idea your prediction is coming true of like when you 00:06:33.380 |
think the end of utterance will occur as you start getting closer to that you 00:06:36.800 |
start refining that that endpoint prediction based on both the semantics 00:06:42.500 |
and the syntax and then as you start getting really as the speaker starts 00:06:45.620 |
getting really close to the end of turn the listener finalizes the prediction by 00:06:50.120 |
using prosody by using information around like the tone and other acoustic 00:06:54.320 |
features so it's three steps a semantic prediction a refinement and a final is a 00:07:01.440 |
finalization and one of the things I want to point out is that the human mind is 00:07:06.120 |
full duplex we're both processing input and we're starting to generate output at the 00:07:10.100 |
same time and I think that's really nicely described in this figure from this 00:07:14.640 |
paper and so on the x-axis what we have here is time where the zero millisecond is the 00:07:21.060 |
end of the speakers turn and what these different blocks represent is what of 00:07:25.500 |
mental processes that are happening inside the mind of the listener and you 00:07:29.520 |
can see well before the end of the turn there's this whole comprehension track 00:07:34.140 |
that's going on where the listener is inferring the intended message of the 00:07:40.680 |
speaker and making predictions about when they're gonna start finish speaking and at 00:07:44.440 |
the same time there's also this production or generation track where the 00:07:48.880 |
listeners is starting to produce what they're going to say and we're gonna talk 00:07:56.300 |
more about full duplex models in computer in silico rather than human minds in a 00:08:01.180 |
little bit in my presentation oh Jordan deersley just texted me he texted boo like how does he 00:08:11.680 |
know to boo me he's not even here okay it's all good we'll keep rolling so so let's go 00:08:21.620 |
back to like let's contrast this really interesting complex process that's going on in the human mind 00:08:28.940 |
compared to modern current voice AI systems you'll see it's just so much more simple right 00:08:33.380 |
it's just speech or not speech it's looking backwards it's not making a prediction it's done in 00:08:39.200 |
serial nothing's happening in parallel so it's it's much more simple and that's part of part of the 00:08:46.220 |
problem right of why these interruptions are happening so there's I'm gonna talk through three types of models and the 00:08:54.860 |
approaches people are using in these three types of models so the the prevailing model for building voice 00:09:00.860 |
AI agents is the cascading model system of models where you have what we talked about earlier speech to text 00:09:07.120 |
VAD LLM TTS and what we're doing in those is we're augmenting these new the new approaches to better handling 00:09:15.420 |
interruptions it is we're augmenting the VAD with models that look at the semantics syntax or prosody 00:09:21.300 |
and I want to jump into an example of it I really have too much content for for my allotted time I'm 00:09:33.920 |
looking at down here but maybe I can just take some time from Jordan so let me let me give an example for 00:09:42.420 |
one of these semantic type models that is used to augment VAD I'm going to talk about our model at 00:09:48.300 |
LiveKit it's a text-based semantic model so what we're doing is we're taking the last four turns of the 00:09:54.240 |
conversation as input so that means that it's the the voice AI agents turn then the users turn then the 00:10:01.800 |
voice AI agents turn and then the users current turn those are the inputs into a transformer model and what 00:10:09.000 |
we're the token that we're predicting because this is an LLM we're predicting the end of utterance token 00:10:14.880 |
if that end of utterance token based on the the content right based on the the context and the 00:10:20.760 |
semantics of that that input if the end of utterance token is saying that the end of turn it hasn't 00:10:28.200 |
happened yet then we don't we extend the silence algorithm part of the VAD and say don't trigger the 00:10:35.760 |
end of turn wait longer so they work in concert that's generally the idea of how it works I mean I'll walk through a 00:10:42.360 |
quick demo of of how this works in action so in this first part of this demo what Shane is going to do is 00:10:54.240 |
talk to a voice agent that is just using the traditional VAD that I was discussing earlier that just looks at 00:11:03.480 |
speech or speech or not speech and in the second half of the demo he's using our semantic end of utterance model 00:11:08.760 |
100% sure what it was that I should please I had to I had to build a demo for what was alive kit a live kit 00:11:23.420 |
turn detection a turn detection demo where one of the agents like got it what challenge it just it just kept 00:11:35.300 |
on interrupting me and being interrupted by the agent constantly that was the worst part for sure I 00:11:41.300 |
don't think it was the worst part for me I understand how did you overcome that challenge during the demo 00:11:48.300 |
well hopefully we're going to overcome the challenge by using the new turn detection model that live kits 00:11:56.180 |
offering so let's try that one now instead hey can you interview me about a time when I had to build a demo 00:12:07.700 |
demo absolutely can you share your experience with building a demo yeah definitely so I needed to build a demo but I 00:12:20.880 |
wasn't I wasn't a hundred percent sure what it was that I should build and then I I was thinking like probably the best way to show that would be 00:12:40.480 |
yeah thank you I'll I'll let no I'll let Shane know that you all applauded his demo uh he'll appreciate that 00:12:48.440 |
yeah you can really see it's a night and day difference when you augment the VAD with models that look at the 00:12:52.940 |
semantics and syntax and prosody um so a good segue to my next slide so there's another type of another approach 00:13:01.120 |
which people are taking to augmenting the VAD and what they're doing is not just taking the semantic input the text based input 00:13:07.340 |
but they're also looking at the audio signal as well and trying to infer things from the acoustic features of the 00:13:14.300 |
uh the basic ideas the input is audio tokens and the output is the probability that the user is finished speaking 00:13:20.340 |
um Quinn in the daily team have built their also open weight smart turn model that that uh this is this neat combination of a model that is both transformer and looking at acoustic 00:13:31.400 |
um characteristics um and then one of the new things that has just emerged has been um assembly ai dropped their their speech to text new streaming speech to text service earlier this week 00:13:43.560 |
and their model is really neat in that it emits it takes audio in and emits out both the transcript and a likelihood that uh the speaker is finished speaking 00:13:54.900 |
um so it's one model that's kind of doing both these two things at the same time um and it's also looking at the acoustic features and the semantic features 00:14:03.240 |
one of the things i want qtai also has one that they recently released it's pretty neat um one of the things i want to note about this though is if you're using your speech to texts built in end of utterance model it's only seeing half the context 00:14:15.400 |
it's only seeing what the user is saying it's not also seeing what the agent is saying so it doesn't quite have the full picture on the context 00:14:21.760 |
um but it works remarkably well these all these approaches work remarkably well and are a major step forward from these 00:14:29.600 |
more traditional vads and like definitely something that if you're building a voice ai agent after 00:14:34.280 |
this conference you should go like implement it they're pretty easy to implement on the different platforms too 00:14:38.880 |
um okay so we often talk about in uh uh speech models or in voice ai that um speech models are uh are going to save us 00:14:52.160 |
um speech to speech models audio in and audio out are going to save us 00:14:55.620 |
um but actually if you look at how these models work that like openai's real-time api they're still using a 00:15:02.980 |
vad on the on the internals um so they're still just looking at speech or not speech or you can you can opt to turn on their semantic 00:15:11.180 |
they call it semantic vad which is kind of a uh paradox it's not the best term but turning on semantic model that augments the vad 00:15:18.580 |
um so to answer the title of my talk of why chat gpt events voice mode keeps interrupting you 00:15:24.020 |
it's because it thinks you're done speaking based on how long it's been since you last said a word 00:15:28.100 |
um or based on what you've uh what you've said previously um and it's just not quite cutting it 00:15:34.420 |
um when those interruptions happen um uh and it's a problem that is not totally solved i want to also 00:15:40.860 |
bring that up too that like this is an ongoing problem um with all the different approaches 00:15:46.380 |
nothing has perfected it yet um our end of live kit doesn't although we power the transport the audio 00:15:53.180 |
layer transport for advanced voice mode um openai is not using our end of utterance model 00:15:59.340 |
so the next topic i want to cover is the um i'm running out of time here but uh there's full duplex 00:16:07.180 |
models these are really neat so a full duplex model is more like a human mind in that it's processing 00:16:11.500 |
input and generating speech at the same time um and as far as i know there's not really any commercial 00:16:17.420 |
applications of these um but they're they're fundamentally they're intuitive talkers they're 00:16:22.380 |
trained on the raw audio data and the analogy i like to use is that it's like computer vision 00:16:27.180 |
in the early days of computer vision we were handwriting algorithms to try to recognize a stop sign based 00:16:32.220 |
on the color and the number of sides on it etc and it just didn't work very well but when we started 00:16:37.900 |
giving the raw image data to the neural network and let the neural network figure it out all of a sudden 00:16:42.460 |
it just started working and i think it's a actually that uh what we learned from computer vision that 00:16:48.300 |
really helped us emerge from the ai winter that was a major kind of uh seeding process for where we are 00:16:53.580 |
now with ai um and it's a similar analogy with full duplex models in that we're handing them the raw audio 00:16:59.180 |
data and we're just letting them figure out how turn taking works rather than trying to hand write all 00:17:03.180 |
the rules um but the downside of these models is they're really optimized for like being really 00:17:08.620 |
good at turn taking and they're kind of dumb llms they're small models they're not trained on a lot of 00:17:13.180 |
data they can't do instruction following very well um and just to give you a sense of like more specifics 00:17:18.540 |
of how these models work let's talk about the motion model um what really made it more concrete for 00:17:23.340 |
me of how this model works is this idea that is always listening to input and it's always generating 00:17:29.820 |
output and even when it's not its turn to speak it's emitting natural silence so it's just basically 00:17:34.700 |
emitting silence that you can't hear but it's still always emitting silence so it's always kind of doing 00:17:39.740 |
both just like a human is um sync llm which is meta ai's uh full duplex like experimental mode that you can 00:17:47.740 |
access inside the app um is a similar full duplex model or it's also a full duplex model something 00:17:54.540 |
neat that i want to bring up about sync llm is they're actually in the internals of that model 00:17:59.580 |
they're forecasting what the user said saying about five tokens ahead or 200 milliseconds ahead which is 00:18:05.260 |
more closely like what humans are doing except we're uh forecasting a much longer time frame 00:18:11.900 |
and then lastly my predictions for the future of how we'll solve this problem uh is i think full duplex 00:18:18.540 |
models are neat but i don't think they're going to solve the problem like i think we just for for real 00:18:23.820 |
production commercial use cases of voice ai we need more control um and we need more control over how it 00:18:29.020 |
says things like brand names um and instead what i think is going to happen is we're going to get smarter 00:18:33.980 |
and smarter vat augmentations and faster and faster models in the cascade pipeline and we're just going to 00:18:39.020 |
have more budget to work with to do a good job with this sort of thing um and the reason i think that's 00:18:45.340 |
true is like computers don't do math the same way humans do they don't have the same conceptual way of 00:18:49.980 |
thinking about it and uh llms think differently than us and similarly i wouldn't expect voice ai to use the 00:18:55.820 |
same mechanisms as the human mind to generate speech and and to talk um thank you all for your attention 00:19:02.860 |
this was fun i really appreciate it we do have some time so i don't know if you want to take 00:19:10.220 |
q a i would love to we could do that um and i could start with the first question 00:19:16.860 |
so the demo you showed there there wasn't any response at the end right i cut off the demo it's 00:19:25.420 |
actually a two minute demo right i only have 18 minutes to speak so i truncated on both sides 00:19:29.340 |
because i was like okay maybe you just turned everything off and it was an impressive demo 00:19:34.620 |
no interruptions no speaking yeah do you have the end of the demo do you want to show it or 00:19:40.700 |
um no no worries it's not more of the same idea like what you can see is that shane was like you 00:19:46.620 |
know taking his time talking and really pausing and thinking it wasn't interrupting him and then when 00:19:51.420 |
it eventually would find his end of turn based on the context cool can we find it on your twitter or 00:19:57.340 |
yeah it's on our link our livekit twitter awesome yeah so we can look that up on the livekit twitter 00:20:02.860 |
uh awesome yeah we can take some questions i saw you had one hi how important are visual cues for turn 00:20:12.380 |
detection in the human context and are there any um is there any development to kind of replicate that in the 00:20:20.860 |
voice ai context as well yeah it's a really neat question of like how important are visual cues and 00:20:26.140 |
are people working on integrating that into the turn taking um intelligence for uh avatars and real-time 00:20:35.100 |
experiences so visual cues are actually despite the fact that we are visual animals um very very much so 00:20:42.380 |
like visual is the most visceral like you know input for us visual cues are actually pretty low down the stack of 00:20:49.500 |
like uh predictors for when it will be end of turn because it really is semantics that's like one of 00:20:54.620 |
the main messages that i want to convey to people from this talk is it's the content of what people 00:20:59.020 |
are saying is the main thing we're using to predict when they're going to finish speaking um and then 00:21:03.740 |
these visual cues are are and these other ones are ancillary to it um and i'm sure somebody's working on 00:21:09.500 |
building something really cool where it's like multimodal and looking at visual cues to look at the 00:21:13.980 |
uh infer the end of turn um i'm just haven't seen it yet and can't keep up with all the ai stuff on the 00:21:19.260 |
internet yes what is the average cost for usually a voice generated call and then how is what is the 00:21:34.540 |
effect when you try to keep regenerating the response 00:21:37.580 |
so the question is what's the average cost for voice ai call um and what is the cost when you keep 00:21:47.900 |
trying to regenerate the response um so i would i would first say that the i think your your reference 00:21:56.940 |
to what is it cost to keep trying to regenerate the response is um what the way i want to answer that is 00:22:03.500 |
actually the thing that's most expensive in the pipeline tends to be the text to speech so there's 00:22:07.820 |
all these optimizations you can do in the cascade and if you end up hitting the llm multiple times 00:22:11.820 |
within a turn it's not it's not all that costly and those sorts of things um there's some really neat 00:22:17.980 |
calculators online because i'm personally not a voice ai agent builder and those unit economics don't 00:22:23.900 |
don't directly affect me i don't have the numbers off the top of my head but there's some really nice 00:22:28.700 |
calculators it's going to depend on how long the conversation is and that sort of thing 00:22:31.900 |
uh yeah thanks for the demo that was great uh the question is about your new model that you just 00:22:44.780 |
shown and that blew us away so uh one is like why is chat gpt not using that model to improve their stuff and two is it available for us to use now if i were to build a uh voice bot on live kit and uh three maybe 00:23:01.180 |
during your development of that one demo is great do you also do some kind of benchmarking with user to 00:23:09.340 |
see if you know this is like 50 better or something like that yes so the first question is about why 00:23:17.100 |
isn't open ai using our end of utterance model i don't know why they're not i think that's maybe above 00:23:23.500 |
my pay grade of this company i just joined four weeks ago um and the second question is like can you 00:23:31.020 |
is our end of utterance model available um for use and so it's really easy on our website to uh or it's 00:23:39.660 |
really easy to follow our quick start on our website and build a voice ai agent that you can talk to um and 00:23:45.340 |
it's just one more line in our in the pipeline that you you build you just turn on you have one more 00:23:51.020 |
line in there and you get to use our end of utterance model it's open weight it's you don't have to pay 00:23:56.060 |
for it it's just baked in um and our docs show you pretty i think by default it's in there um and the 00:24:02.940 |
the third question uh remind me of the third question i'm sorry 00:24:06.460 |
did you do benchmark ah yes benchmarking um so we have benchmarks where we have our test data set and 00:24:14.620 |
you know the numbers of course look great um but i think uh i was on a long call with our machine 00:24:22.540 |
learning team this morning where we spent a lot of time just talking about like how do we get a good 00:24:26.700 |
data set for benchmarking that and it's just it's just really it's a tough problem um and i feel like 00:24:32.940 |
like the industry as a whole doesn't have a good benchmark around turn taking um and that it's 00:24:38.380 |
something that i'm sure will eventually emerge uh okay we'll do one last question i think so there 00:24:45.900 |
was yes you in the back not great to be sitting in the back if you want to ask questions but 00:24:51.740 |
thank you tom for the diamond and the presentation i got a question related to the back channel so 00:24:59.260 |
how did you tackle the back channel challenging the turn taking detection problem so first 00:25:05.020 |
for a natural conversational ai the back channel is the one that cannot be ignored and sometimes it 00:25:12.780 |
cause trouble for the voice agent to detect whether it is the what it is the end pointing and second for 00:25:19.660 |
a typical back channel like yeah um yes those words can occur like a back channel or it can be started as the 00:25:28.460 |
uh the agent who should be responding in the in the following period so how the back channel is handled 00:25:35.420 |
should be kind of important in this field so um my question was or the question was about not when 00:25:45.100 |
the not the case where the ai is interrupting the human but the case where the human is accidentally 00:25:50.860 |
interrupting the ai um and i didn't really cover that in my talk i was mostly focusing on the ai 00:25:56.620 |
interrupting the human um we don't have our approach is simple like we're just using like the solero 00:26:04.940 |
or the the normal vad approach of like if the person is speaking for more than x milliseconds assume 00:26:11.260 |
it's not a back channel and that they're actually trying to interrupt the voice ai but one of the things 00:26:15.260 |
we want to build is another machine learning model that can like recognize the difference between 00:26:20.220 |
whether or not some back channel or someone trying to uh interrupt the the voice ai one quick note on 00:26:28.220 |
the full duplex models uh the meta ai one can natively back channel because it's like learned from the raw 00:26:34.460 |
audio data so when you're talking to it it'll go uh-huh which is just so neat um and uh yeah it's just a 00:26:42.220 |
it's a tough problem the back channeling thing awesome um yeah if you have more questions you 00:26:49.660 |
can find tom uh please give another warm uh applause for tom thank you all