Giving a Voice to AI Agents: Scott Stephenson, CEO, Deepgram

Hey everybody if you don't know about Deepgram, Deepgram is a company that does audio AI so we've been around for nine years now so ancient and AI years but we we brought end-to-end deep learning to speech recognition and our most recent product is a TTS product that we released back in December and so you can use our API on-prem, you can use it in the cloud, you can adapt models to a specific acoustic environment.

I just came from a drive-through that has all sorts of crazy things happening in it like cars running and Amazon trucks backing up and all of that and we like automatically adapt models for that kind of thing etc. So anyway if you don't know about us that's what we do.

Also we're a research led company so what that means is we figure out fundamental things through our research team and then we bring them to the market as soon as possible through our product team. So yeah it's it's everything is moving at breakneck speed but a lot of the things that folks enjoy now with like really fast accurate real-time speech recognition if you have any devices out there.

I was just talking with the founder a couple minutes ago. Most of them use Deepgram behind the scenes so just that's that's who we are and we help companies build truly conversational experiences and the bar that we set for ourselves and we set from the beginning but it takes a long time to get there is what can a human do and we ask this all the time in our product meetings etc and our research team and that helps guide us toward what's possible because neural networks really do work in a way that is similar to a human they can learn by example etc and so you have to think about how can we formulate this problem in a way that this this machine can actually learn from it and so we spend all of our all of our time thinking that way and so arguments about like how many neurons does it have is it equivalent to a mouse or a cat or a person or whatever we're talking about that kind of thing all the time and it's amazing to see the progress that has happened over the last decade and we're at a an amazing time right now so um one thing to keep in mind is that there was like a previous version of voice ai and it was kind of you know voice ai 1.0 think siri think things like that where it was slow it wasn't super accurate um it wasn't really a real-time feel to it um and it you could try to ask it anything but it wouldn't really um it only answer in a very specific domain so now with the the next version of voice ai there's uh it's open-ended and a lot of what's driving that is llms you can put any text into it and uh depending on how the model is trained you can output any text and so if you're having a voice ai conversation you can take the audio turn it using speech to text software turn it into text inject it into an llm model get text out the other side and then use a text to speech model in order to produce audio and uh part of the this now voice ai 2.0 is how fast can you do that in in how smart of a way can you do that so that the agent actually like does what it's supposed to and then how expressive does it sound and so we'll talk about that a little bit but you know this is uh the the type of process speech like speech to text to an llm i would just internally call that text to text who cares if it's a if it's a transformer model or whatever it is but you transform speech to text and then text to text then text to speech and you keep that loop going um and i would say uh what we're what we're what we're seeing right now and we we might you know the folks out there might say there's a big hype cycle we're at the peak of it or something i would say uh hang on a second this is more like 1910 you feel you see the first cars in the street or something like that um really it's just a tiny thing that's happening right now and there's going to be a massive 100x uh explosion in the next like decade for uh voice ai text etc it's uh ai is going to be everywhere the way i would think about this and you know encourage others to think about it this way is there's there was like an agricultural revolution that took like a thousand two thousand years to happen then there's an industrial revolution that took like maybe 250 300 years to happen then there's an information revolution that took like 75 years to happen to see the trend the intelligence revolution is going to take like 25 maybe 30 years to happen and so if you thought tech companies were fast before ai companies have to move three times faster um so uh anyway this is how it works today there's a um speech to text system you can see in there there's an llm or a text to text system and uh then a text to speech system and each of these systems works fairly independently um but the state of the art works very quickly and they have high efficacy or accuracy so speech to text now from from even just five or you know eight years ago it used to be maybe 75 accuracy or somewhere around there uh now it's like over 90 and that over 90 uh really makes a big difference also speech to text used to be maybe two to five second delay for real time now it's like 100 milliseconds or maybe 200 milliseconds um with the high accuracy and also you can run it on prem and you can co-locate all these services together uh actually the the founder of daily i just saw him roaming around they just came out with a blog post showing that you can do the entire uh voice ai round trip conversation in less than 500 milliseconds using deepgram for the speech to text llama for the ttt and then uh deepgram for the texas speech uh but nevertheless that's what a human responds and it's between like 400 to 600 milliseconds in turn taking so um you can you can do all that here but there is a piece that's missing which is um if you these are all like i just said speech to text models text to text models text to speech they're not passing along any context throughout the conversation so the what what this ends up with is a few a few spots that you're like hang on a second um it's not really getting exactly what i'm saying um and maybe that only happens 10 of the interactions or 20 of the interactions or what i really mean is like 10 or 20 of the turns um but the way that you combat that is by adding in context and so um the that's that's what this this view is right here uh it looks like a really subtle change but instead of there being speech to text models now where it just takes an audio and it puts out text it will take an audio and context so think promptable but not necessarily text promptable it could be promptable with anything be promptable with other audio it could be promptable with images it could be promptable with documents um it could be promptable with the previous turn of the conversation um but what that gives you is it gives that speech to text model context um when you send something to a speech to text model right now it actually has to it's kind of amazing what it's able to do it knows nothing about the conversation and then it's just thrown into like a basketball game and has to transcribe everything you know quickly and with just like a few seconds of context and then do a really good job um what happens when you give that model the entire context of the conversation up until that point it gets way more accurate but it's not just about the accuracy because the next step in that is once you uh once you pat you can pass that context along you can pass the original input context along but you can also have your speech to text model output context as well and that can output text that is human readable it can output audio it could output images it get output just embeddings um which people are now familiar with is like just a vector embedding um and so that can carry the state of the conversation um throughout the entire thing i just want to point out this is not how systems are built right now um but in the next year this is how they're going to be built and this is when things are going to flip into like holy shit this is this feels like human and because once that text to text is contextual from the audio it knows it's hearing an angry person or a happy person it knows that the conversation is flowing quickly or slowly it knows that there's you know light music playing in the background it knows all this stuff right and so that text to text model can now generate the appropriate response but it's not just about the text that it generates it it generates its own context right so we can say to the text to speech model hey i need you to say this softly i need you to say it slowly you know i'm speaking very quickly right now right but i need you to say it slowly i need you to say in an authoritative tone you know that kind of thing and uh then when the texas speech model generates that audio it will say i tried to generate it this way i sounded like this i think i did a good job etc that's context that's going to get passed to the next turn in the conversation and so all this context is going to be passed around we call it contextual ai internally at deepgram but this is what the next generation of these models is going to look like and this is actually the innovation that's going to make it feel like a human because the speed part is taken care of the accuracy part is taken care of now it's all about context um i know there's uh to preempt any follow-up questions about uh like a multimodal model or a speech-to-speech model sure absolutely it's uh you may mold or like meld meld some of these together you may put them all together the problem with putting them all together is it's not as controllable so we we are the largest uh speech-to-text api in the world now but that's mostly because of businesses using us to power like spotify to power food ordering to power um call centers and that kind of thing and they need controllability in in what they're doing so if you just give like an open-ended prompt to a speech-to-speech model and just say like go to town that's not the kind of experience that like a bank wants you know they want a little more control they want to maybe put a whole bunch of compute power in the speech-to-text to make sure that they get everything uh precisely right and then the text-to-text doesn't actually have to be all that big because they're just doing a few they're just doing a few things like helping them reset their password or something like that and then they want the text-to-speech to be like really expressive but like calm and only a single voice and so these are all going to be kind of compartmentalized um because that brings us to the cogs conversation the cost of goods sold um right now a lot of people probably feel that you know ai is kind of expensive but uh it doesn't have to be if you use the right tools and you use the right services that focus on uh cost of goods sold and so um and if you if you choose the right size for each component in the stack so um the so in the future that's what it's going to look like um i'm trying to hurry through this because i uh want to leave room for questions it's already been like 14 minutes um or maybe 12 minutes or so but uh uh i'll give you just a flash of like what the future will look like but um one thing that we're doing as you know a platform at deepgram um we think hey if you if you want if you're doing anything in audio you should be thinking about deepgram maybe you don't use this for every piece of it but you should be thinking hey if i want low latency real-time speech to text you should definitely be thinking about deepgram if you want to be uh if you want to be using low latency text to speech uh definitely at least talk to us um that's a new product for us but you know the next version will be even even more expressive but right now it's i would say uh better than like amazon uh microsoft etc than their neural models not quite as good as 11 labs um but uh but anyway the next product coming out for us which i want to give everybody the chance to uh try out or apply to for our preview is our voice ai agent which is a full full stack where we put everything together so you could if you want you can use your own api keys and use your own um llms um but also you could have it all put together with deepgram and this helps with reducing that latency so you get your um turn taking down to a very short you know 300 milliseconds 500 milliseconds 600 milliseconds rather than like 800 or 1500 or something if you tried to piece them together yourself um and if you want to get access to this voice ai agent api we have a qr code here um and we have some folks in the back too if you saw our workshop i think two days ago with damian he gave an awesome workshop he's in the back you can talk to him about this um but also just feel free to screenshot this or go to the link now or whatever it is um also uh at at deepgram we um give out uh 250 so anybody can uh try it out thanks everyone

Giving a Voice to AI Agents: Scott Stephenson, CEO, Deepgram

Transcript