back to indexGiving a Voice to AI Agents: Scott Stephenson, CEO, Deepgram

00:00:00.000 |
Hey everybody if you don't know about Deepgram, Deepgram is a company that does audio AI so we've 00:00:18.880 |
been around for nine years now so ancient and AI years but we we brought end-to-end deep learning 00:00:28.620 |
to speech recognition and our most recent product is a TTS product that we released back in December 00:00:37.180 |
and so you can use our API on-prem, you can use it in the cloud, you can adapt models to a specific 00:00:45.660 |
acoustic environment. I just came from a drive-through that has all sorts of crazy things happening in it 00:00:50.840 |
like cars running and Amazon trucks backing up and all of that and we like automatically adapt models 00:00:57.240 |
for that kind of thing etc. So anyway if you don't know about us that's what we do. Also we're a research 00:01:03.800 |
led company so what that means is we figure out fundamental things through our research team and 00:01:09.880 |
then we bring them to the market as soon as possible through our product team. So yeah it's it's everything 00:01:17.480 |
is moving at breakneck speed but a lot of the things that folks enjoy now with like really fast accurate 00:01:23.880 |
real-time speech recognition if you have any devices out there. I was just talking with the 00:01:30.440 |
founder a couple minutes ago. Most of them use Deepgram behind the scenes so just that's that's who we are 00:01:37.560 |
and we help companies build truly conversational experiences and the bar that we set for ourselves 00:01:44.440 |
and we set from the beginning but it takes a long time to get there is what can a human do and we ask 00:01:51.000 |
this all the time in our product meetings etc and our research team and that helps guide us toward what's 00:01:58.040 |
possible because neural networks really do work in a way that is similar to a human they can learn by 00:02:04.040 |
example etc and so you have to think about how can we formulate this problem in a way that this this 00:02:10.600 |
machine can actually learn from it and so we spend all of our all of our time thinking that way and so 00:02:15.640 |
arguments about like how many neurons does it have is it equivalent to a mouse or a cat or a person or 00:02:20.120 |
whatever we're talking about that kind of thing all the time and it's amazing to see the progress that 00:02:25.240 |
has happened over the last decade and we're at a an amazing time right now so um one thing to keep in 00:02:31.640 |
mind is that there was like a previous version of voice ai and it was kind of you know voice ai 1.0 00:02:39.720 |
think siri think things like that where it was slow it wasn't super accurate um it wasn't really a 00:02:45.000 |
real-time feel to it um and it you could try to ask it anything but it wouldn't really um it only answer in 00:02:52.440 |
a very specific domain so now with the the next version of voice ai there's uh it's open-ended and 00:02:59.400 |
a lot of what's driving that is llms you can put any text into it and uh depending on how the model is 00:03:05.240 |
trained you can output any text and so if you're having a voice ai conversation you can take the audio 00:03:11.400 |
turn it using speech to text software turn it into text inject it into an llm model get text out the other 00:03:16.440 |
side and then use a text to speech model in order to produce audio and uh part of the this now voice ai 00:03:23.960 |
2.0 is how fast can you do that in in how smart of a way can you do that so that the agent actually like 00:03:30.120 |
does what it's supposed to and then how expressive does it sound and so we'll talk about that a little 00:03:34.760 |
bit but you know this is uh the the type of process speech like speech to text to an llm i would just 00:03:41.160 |
internally call that text to text who cares if it's a if it's a transformer model or whatever 00:03:46.200 |
it is but you transform speech to text and then text to text then text to speech and you keep that loop 00:03:51.240 |
going um and i would say uh what we're what we're what we're seeing right now and we we might you know 00:03:59.320 |
the folks out there might say there's a big hype cycle we're at the peak of it or something i would say 00:04:04.040 |
uh hang on a second this is more like 1910 you feel you see the first cars in the street or something 00:04:10.040 |
like that um really it's just a tiny thing that's happening right now and there's going to be a 00:04:14.680 |
massive 100x uh explosion in the next like decade for uh voice ai text etc it's uh ai is going to be 00:04:21.640 |
everywhere the way i would think about this and you know encourage others to think about it this way is 00:04:26.520 |
there's there was like an agricultural revolution that took like a thousand two thousand years to happen 00:04:31.160 |
then there's an industrial revolution that took like maybe 250 300 years to happen then there's an 00:04:36.120 |
information revolution that took like 75 years to happen to see the trend the intelligence revolution 00:04:41.800 |
is going to take like 25 maybe 30 years to happen and so if you thought tech companies were fast before 00:04:47.000 |
ai companies have to move three times faster um so uh anyway this is how it works today there's 00:04:53.720 |
a um speech to text system you can see in there there's an llm or a text to text system and uh then 00:05:01.960 |
a text to speech system and each of these systems works fairly independently um but the state of the art 00:05:07.960 |
works very quickly and they have high efficacy or accuracy so speech to text now from from even just five or 00:05:16.360 |
you know eight years ago it used to be maybe 75 accuracy or somewhere around there uh now it's 00:05:22.360 |
like over 90 and that over 90 uh really makes a big difference also speech to text used to be maybe 00:05:28.520 |
two to five second delay for real time now it's like 100 milliseconds or maybe 200 milliseconds um 00:05:35.240 |
with the high accuracy and also you can run it on prem and you can co-locate all these services together 00:05:41.800 |
uh actually the the founder of daily i just saw him roaming around they just came out with a blog 00:05:46.040 |
post showing that you can do the entire uh voice ai round trip conversation in less than 500 milliseconds 00:05:51.800 |
using deepgram for the speech to text llama for the ttt and then uh deepgram for the texas speech 00:05:57.640 |
uh but nevertheless that's what a human responds and it's between like 400 to 600 milliseconds in turn 00:06:03.400 |
taking so um you can you can do all that here but there is a piece that's missing which is um if you 00:06:11.400 |
these are all like i just said speech to text models text to text models text to speech they're not 00:06:16.280 |
passing along any context throughout the conversation so the what what this ends up with is a few a few 00:06:23.720 |
spots that you're like hang on a second um it's not really getting exactly what i'm saying um and maybe 00:06:30.520 |
that only happens 10 of the interactions or 20 of the interactions or what i really mean is like 10 or 20 of 00:06:37.080 |
the turns um but the way that you combat that is by adding in context and so um the that's that's what 00:06:45.880 |
this this view is right here uh it looks like a really subtle change but instead of there being speech to 00:06:51.560 |
text models now where it just takes an audio and it puts out text it will take an audio and context so 00:06:57.400 |
think promptable but not necessarily text promptable it could be promptable with anything be promptable with 00:07:02.840 |
other audio it could be promptable with images it could be promptable with documents um it could be 00:07:08.040 |
promptable with the previous turn of the conversation um but what that gives you is it gives that speech to 00:07:13.400 |
text model context um when you send something to a speech to text model right now it actually has to 00:07:19.160 |
it's kind of amazing what it's able to do it knows nothing about the conversation and then it's just 00:07:23.400 |
thrown into like a basketball game and has to transcribe everything you know quickly and with just like a few 00:07:29.800 |
seconds of context and then do a really good job um what happens when you give that model the entire 00:07:34.920 |
context of the conversation up until that point it gets way more accurate but it's not just about the 00:07:39.880 |
accuracy because the next step in that is once you uh once you pat you can pass that context along 00:07:46.120 |
you can pass the original input context along but you can also have your speech to text model output 00:07:50.760 |
context as well and that can output text that is human readable it can output audio it could output images it 00:07:56.600 |
get output just embeddings um which people are now familiar with is like just a vector embedding um 00:08:03.240 |
and so that can carry the state of the conversation um throughout the entire thing i just want to point 00:08:08.280 |
out this is not how systems are built right now um but in the next year this is how they're going to be 00:08:11.960 |
built and this is when things are going to flip into like holy shit this is this feels like human 00:08:18.360 |
and because once that text to text is contextual from the audio it knows it's hearing an angry 00:08:24.360 |
person or a happy person it knows that the conversation is flowing quickly or slowly it knows 00:08:30.040 |
that there's you know light music playing in the background it knows all this stuff right and so that 00:08:34.600 |
text to text model can now generate the appropriate response but it's not just about the text that it 00:08:39.000 |
generates it it generates its own context right so we can say to the text to speech model hey i need you to 00:08:44.440 |
say this softly i need you to say it slowly you know i'm speaking very quickly right now right but 00:08:49.880 |
i need you to say it slowly i need you to say in an authoritative tone you know that kind of thing and 00:08:56.120 |
uh then when the texas speech model generates that audio it will say i tried to generate it this way i 00:09:02.680 |
sounded like this i think i did a good job etc that's context that's going to get passed to the next turn in 00:09:08.200 |
the conversation and so all this context is going to be passed around we call it contextual ai internally 00:09:13.320 |
at deepgram but this is what the next generation of these models is going to look like and this is 00:09:17.640 |
actually the innovation that's going to make it feel like a human because the speed part is taken care 00:09:22.520 |
of the accuracy part is taken care of now it's all about context um i know there's uh to preempt any 00:09:28.600 |
follow-up questions about uh like a multimodal model or a speech-to-speech model sure absolutely it's uh you may mold or like meld 00:09:37.640 |
meld some of these together you may put them all together the problem with putting them all together 00:09:41.880 |
is it's not as controllable so we we are the largest uh speech-to-text api in the world now but that's 00:09:49.400 |
mostly because of businesses using us to power like spotify to power food ordering to power um call centers 00:09:57.800 |
and that kind of thing and they need controllability in in what they're doing so if you just give like an 00:10:04.360 |
open-ended prompt to a speech-to-speech model and just say like go to town that's not the kind of 00:10:09.160 |
experience that like a bank wants you know they want a little more control they want to maybe put a 00:10:14.760 |
whole bunch of compute power in the speech-to-text to make sure that they get everything uh precisely 00:10:19.480 |
right and then the text-to-text doesn't actually have to be all that big because they're just doing 00:10:23.720 |
a few they're just doing a few things like helping them reset their password or something like that and 00:10:28.920 |
then they want the text-to-speech to be like really expressive but like calm and only a single voice 00:10:33.320 |
and so these are all going to be kind of compartmentalized um because that brings us to the cogs 00:10:37.880 |
conversation the cost of goods sold um right now a lot of people probably feel that you know ai is kind 00:10:43.000 |
of expensive but uh it doesn't have to be if you use the right tools and you use the right services that 00:10:48.040 |
focus on uh cost of goods sold and so um and if you if you choose the right size for each component in the 00:10:54.920 |
stack so um the so in the future that's what it's going to look like um i'm trying to hurry through 00:11:02.840 |
this because i uh want to leave room for questions it's already been like 14 minutes um or maybe 12 minutes 00:11:08.360 |
or so but uh uh i'll give you just a flash of like what the future will look like but um one thing that we're 00:11:14.920 |
doing as you know a platform at deepgram um we think hey if you if you want if you're doing anything in 00:11:21.240 |
audio you should be thinking about deepgram maybe you don't use this for every piece of it but you 00:11:24.760 |
should be thinking hey if i want low latency real-time speech to text you should definitely be thinking 00:11:29.880 |
about deepgram if you want to be uh if you want to be using low latency text to speech uh definitely at 00:11:36.440 |
least talk to us um that's a new product for us but you know the next version will be even even more 00:11:40.920 |
expressive but right now it's i would say uh better than like amazon uh microsoft etc than their neural 00:11:46.840 |
models not quite as good as 11 labs um but uh but anyway the next product coming out for us which 00:11:53.720 |
i want to give everybody the chance to uh try out or apply to for our preview is our voice ai agent which 00:12:02.280 |
is a full full stack where we put everything together so you could if you want you can use your own api 00:12:08.040 |
keys and use your own um llms um but also you could have it all put together with deepgram and this 00:12:13.960 |
helps with reducing that latency so you get your um turn taking down to a very short you know 300 00:12:20.360 |
milliseconds 500 milliseconds 600 milliseconds rather than like 800 or 1500 or something if you tried to 00:12:25.640 |
piece them together yourself um and if you want to get access to this voice ai agent api we have a qr code 00:12:32.680 |
here um and we have some folks in the back too if you saw our workshop i think two days ago with damian 00:12:38.920 |
he gave an awesome workshop he's in the back you can talk to him about this um but also just feel free to 00:12:44.840 |
screenshot this or go to the link now or whatever it is um also uh at at deepgram we um give out uh 250