back to index

Giving a Voice to AI Agents: Scott Stephenson, CEO, Deepgram


Whisper Transcript | Transcript Only Page

00:00:00.000 | Hey everybody if you don't know about Deepgram, Deepgram is a company that does audio AI so we've
00:00:18.880 | been around for nine years now so ancient and AI years but we we brought end-to-end deep learning
00:00:28.620 | to speech recognition and our most recent product is a TTS product that we released back in December
00:00:37.180 | and so you can use our API on-prem, you can use it in the cloud, you can adapt models to a specific
00:00:45.660 | acoustic environment. I just came from a drive-through that has all sorts of crazy things happening in it
00:00:50.840 | like cars running and Amazon trucks backing up and all of that and we like automatically adapt models
00:00:57.240 | for that kind of thing etc. So anyway if you don't know about us that's what we do. Also we're a research
00:01:03.800 | led company so what that means is we figure out fundamental things through our research team and
00:01:09.880 | then we bring them to the market as soon as possible through our product team. So yeah it's it's everything
00:01:17.480 | is moving at breakneck speed but a lot of the things that folks enjoy now with like really fast accurate
00:01:23.880 | real-time speech recognition if you have any devices out there. I was just talking with the
00:01:30.440 | founder a couple minutes ago. Most of them use Deepgram behind the scenes so just that's that's who we are
00:01:37.560 | and we help companies build truly conversational experiences and the bar that we set for ourselves
00:01:44.440 | and we set from the beginning but it takes a long time to get there is what can a human do and we ask
00:01:51.000 | this all the time in our product meetings etc and our research team and that helps guide us toward what's
00:01:58.040 | possible because neural networks really do work in a way that is similar to a human they can learn by
00:02:04.040 | example etc and so you have to think about how can we formulate this problem in a way that this this
00:02:10.600 | machine can actually learn from it and so we spend all of our all of our time thinking that way and so
00:02:15.640 | arguments about like how many neurons does it have is it equivalent to a mouse or a cat or a person or
00:02:20.120 | whatever we're talking about that kind of thing all the time and it's amazing to see the progress that
00:02:25.240 | has happened over the last decade and we're at a an amazing time right now so um one thing to keep in
00:02:31.640 | mind is that there was like a previous version of voice ai and it was kind of you know voice ai 1.0
00:02:39.720 | think siri think things like that where it was slow it wasn't super accurate um it wasn't really a
00:02:45.000 | real-time feel to it um and it you could try to ask it anything but it wouldn't really um it only answer in
00:02:52.440 | a very specific domain so now with the the next version of voice ai there's uh it's open-ended and
00:02:59.400 | a lot of what's driving that is llms you can put any text into it and uh depending on how the model is
00:03:05.240 | trained you can output any text and so if you're having a voice ai conversation you can take the audio
00:03:11.400 | turn it using speech to text software turn it into text inject it into an llm model get text out the other
00:03:16.440 | side and then use a text to speech model in order to produce audio and uh part of the this now voice ai
00:03:23.960 | 2.0 is how fast can you do that in in how smart of a way can you do that so that the agent actually like
00:03:30.120 | does what it's supposed to and then how expressive does it sound and so we'll talk about that a little
00:03:34.760 | bit but you know this is uh the the type of process speech like speech to text to an llm i would just
00:03:41.160 | internally call that text to text who cares if it's a if it's a transformer model or whatever
00:03:46.200 | it is but you transform speech to text and then text to text then text to speech and you keep that loop
00:03:51.240 | going um and i would say uh what we're what we're what we're seeing right now and we we might you know
00:03:59.320 | the folks out there might say there's a big hype cycle we're at the peak of it or something i would say
00:04:04.040 | uh hang on a second this is more like 1910 you feel you see the first cars in the street or something
00:04:10.040 | like that um really it's just a tiny thing that's happening right now and there's going to be a
00:04:14.680 | massive 100x uh explosion in the next like decade for uh voice ai text etc it's uh ai is going to be
00:04:21.640 | everywhere the way i would think about this and you know encourage others to think about it this way is
00:04:26.520 | there's there was like an agricultural revolution that took like a thousand two thousand years to happen
00:04:31.160 | then there's an industrial revolution that took like maybe 250 300 years to happen then there's an
00:04:36.120 | information revolution that took like 75 years to happen to see the trend the intelligence revolution
00:04:41.800 | is going to take like 25 maybe 30 years to happen and so if you thought tech companies were fast before
00:04:47.000 | ai companies have to move three times faster um so uh anyway this is how it works today there's
00:04:53.720 | a um speech to text system you can see in there there's an llm or a text to text system and uh then
00:05:01.960 | a text to speech system and each of these systems works fairly independently um but the state of the art
00:05:07.960 | works very quickly and they have high efficacy or accuracy so speech to text now from from even just five or
00:05:16.360 | you know eight years ago it used to be maybe 75 accuracy or somewhere around there uh now it's
00:05:22.360 | like over 90 and that over 90 uh really makes a big difference also speech to text used to be maybe
00:05:28.520 | two to five second delay for real time now it's like 100 milliseconds or maybe 200 milliseconds um
00:05:35.240 | with the high accuracy and also you can run it on prem and you can co-locate all these services together
00:05:41.800 | uh actually the the founder of daily i just saw him roaming around they just came out with a blog
00:05:46.040 | post showing that you can do the entire uh voice ai round trip conversation in less than 500 milliseconds
00:05:51.800 | using deepgram for the speech to text llama for the ttt and then uh deepgram for the texas speech
00:05:57.640 | uh but nevertheless that's what a human responds and it's between like 400 to 600 milliseconds in turn
00:06:03.400 | taking so um you can you can do all that here but there is a piece that's missing which is um if you
00:06:11.400 | these are all like i just said speech to text models text to text models text to speech they're not
00:06:16.280 | passing along any context throughout the conversation so the what what this ends up with is a few a few
00:06:23.720 | spots that you're like hang on a second um it's not really getting exactly what i'm saying um and maybe
00:06:30.520 | that only happens 10 of the interactions or 20 of the interactions or what i really mean is like 10 or 20 of
00:06:37.080 | the turns um but the way that you combat that is by adding in context and so um the that's that's what
00:06:45.880 | this this view is right here uh it looks like a really subtle change but instead of there being speech to
00:06:51.560 | text models now where it just takes an audio and it puts out text it will take an audio and context so
00:06:57.400 | think promptable but not necessarily text promptable it could be promptable with anything be promptable with
00:07:02.840 | other audio it could be promptable with images it could be promptable with documents um it could be
00:07:08.040 | promptable with the previous turn of the conversation um but what that gives you is it gives that speech to
00:07:13.400 | text model context um when you send something to a speech to text model right now it actually has to
00:07:19.160 | it's kind of amazing what it's able to do it knows nothing about the conversation and then it's just
00:07:23.400 | thrown into like a basketball game and has to transcribe everything you know quickly and with just like a few
00:07:29.800 | seconds of context and then do a really good job um what happens when you give that model the entire
00:07:34.920 | context of the conversation up until that point it gets way more accurate but it's not just about the
00:07:39.880 | accuracy because the next step in that is once you uh once you pat you can pass that context along
00:07:46.120 | you can pass the original input context along but you can also have your speech to text model output
00:07:50.760 | context as well and that can output text that is human readable it can output audio it could output images it
00:07:56.600 | get output just embeddings um which people are now familiar with is like just a vector embedding um
00:08:03.240 | and so that can carry the state of the conversation um throughout the entire thing i just want to point
00:08:08.280 | out this is not how systems are built right now um but in the next year this is how they're going to be
00:08:11.960 | built and this is when things are going to flip into like holy shit this is this feels like human
00:08:18.360 | and because once that text to text is contextual from the audio it knows it's hearing an angry
00:08:24.360 | person or a happy person it knows that the conversation is flowing quickly or slowly it knows
00:08:30.040 | that there's you know light music playing in the background it knows all this stuff right and so that
00:08:34.600 | text to text model can now generate the appropriate response but it's not just about the text that it
00:08:39.000 | generates it it generates its own context right so we can say to the text to speech model hey i need you to
00:08:44.440 | say this softly i need you to say it slowly you know i'm speaking very quickly right now right but
00:08:49.880 | i need you to say it slowly i need you to say in an authoritative tone you know that kind of thing and
00:08:56.120 | uh then when the texas speech model generates that audio it will say i tried to generate it this way i
00:09:02.680 | sounded like this i think i did a good job etc that's context that's going to get passed to the next turn in
00:09:08.200 | the conversation and so all this context is going to be passed around we call it contextual ai internally
00:09:13.320 | at deepgram but this is what the next generation of these models is going to look like and this is
00:09:17.640 | actually the innovation that's going to make it feel like a human because the speed part is taken care
00:09:22.520 | of the accuracy part is taken care of now it's all about context um i know there's uh to preempt any
00:09:28.600 | follow-up questions about uh like a multimodal model or a speech-to-speech model sure absolutely it's uh you may mold or like meld
00:09:37.640 | meld some of these together you may put them all together the problem with putting them all together
00:09:41.880 | is it's not as controllable so we we are the largest uh speech-to-text api in the world now but that's
00:09:49.400 | mostly because of businesses using us to power like spotify to power food ordering to power um call centers
00:09:57.800 | and that kind of thing and they need controllability in in what they're doing so if you just give like an
00:10:04.360 | open-ended prompt to a speech-to-speech model and just say like go to town that's not the kind of
00:10:09.160 | experience that like a bank wants you know they want a little more control they want to maybe put a
00:10:14.760 | whole bunch of compute power in the speech-to-text to make sure that they get everything uh precisely
00:10:19.480 | right and then the text-to-text doesn't actually have to be all that big because they're just doing
00:10:23.720 | a few they're just doing a few things like helping them reset their password or something like that and
00:10:28.920 | then they want the text-to-speech to be like really expressive but like calm and only a single voice
00:10:33.320 | and so these are all going to be kind of compartmentalized um because that brings us to the cogs
00:10:37.880 | conversation the cost of goods sold um right now a lot of people probably feel that you know ai is kind
00:10:43.000 | of expensive but uh it doesn't have to be if you use the right tools and you use the right services that
00:10:48.040 | focus on uh cost of goods sold and so um and if you if you choose the right size for each component in the
00:10:54.920 | stack so um the so in the future that's what it's going to look like um i'm trying to hurry through
00:11:02.840 | this because i uh want to leave room for questions it's already been like 14 minutes um or maybe 12 minutes
00:11:08.360 | or so but uh uh i'll give you just a flash of like what the future will look like but um one thing that we're
00:11:14.920 | doing as you know a platform at deepgram um we think hey if you if you want if you're doing anything in
00:11:21.240 | audio you should be thinking about deepgram maybe you don't use this for every piece of it but you
00:11:24.760 | should be thinking hey if i want low latency real-time speech to text you should definitely be thinking
00:11:29.880 | about deepgram if you want to be uh if you want to be using low latency text to speech uh definitely at
00:11:36.440 | least talk to us um that's a new product for us but you know the next version will be even even more
00:11:40.920 | expressive but right now it's i would say uh better than like amazon uh microsoft etc than their neural
00:11:46.840 | models not quite as good as 11 labs um but uh but anyway the next product coming out for us which
00:11:53.720 | i want to give everybody the chance to uh try out or apply to for our preview is our voice ai agent which
00:12:02.280 | is a full full stack where we put everything together so you could if you want you can use your own api
00:12:08.040 | keys and use your own um llms um but also you could have it all put together with deepgram and this
00:12:13.960 | helps with reducing that latency so you get your um turn taking down to a very short you know 300
00:12:20.360 | milliseconds 500 milliseconds 600 milliseconds rather than like 800 or 1500 or something if you tried to
00:12:25.640 | piece them together yourself um and if you want to get access to this voice ai agent api we have a qr code
00:12:32.680 | here um and we have some folks in the back too if you saw our workshop i think two days ago with damian
00:12:38.920 | he gave an awesome workshop he's in the back you can talk to him about this um but also just feel free to
00:12:44.840 | screenshot this or go to the link now or whatever it is um also uh at at deepgram we um give out uh 250
00:12:52.360 | so anybody can uh try it out thanks everyone