back to index

Building an AI assistant that makes phone calls [Convex Workshop]


Whisper Transcript | Transcript Only Page

00:00:00.000 | Well, hello everybody. Thank you for braving the early morning. 8 a.m. is rough for the best of us,
00:00:20.720 | so I really do appreciate the crawling out of bed and coming to check out this talk. My name is Tom
00:00:28.460 | Redmond. I am the head of DX at a company called Convex. We're building a platform
00:00:36.680 | that people are able to build their companies on from day one to year two,
00:00:42.860 | hopefully year 10 and year 20 and beyond. Today I wanted to walk through this idea
00:00:52.140 | that I've had for a long time about building a better AI assistant. It came
00:00:57.980 | to me a little while ago after trying a number of AI assistants, most of which can
00:01:05.360 | pretty much set calendar events and reminders and timers and things like that.
00:01:10.460 | And I thought to myself, what does a real personal assistant do for people?
00:01:18.680 | A lot of the time, short of collecting laundry and doing physical things, they're on the phone
00:01:26.900 | and they're on email. And so I realized I feel like there's enough technology out there these days
00:01:32.120 | that we could actually string together a number of platforms such that you could have an AI assistant
00:01:38.840 | that knows about you. It has context on you and your life and who you are and would be able
00:01:46.220 | to manage a conversation in a non-creepy way with another human being and the technology exists such that we can do all of this in real time.
00:01:58.840 | We can transcribe speech to text in real time. We can transpose text to speech in nearly real time as well.
00:02:12.600 | And so I kind of wanted to piece all these things together and that's what we're going to go through today.
00:02:16.260 | So this is a better AI assistant. This is Floyd.
00:02:20.640 | This is actually Lloyd from Entourage, but when I was thinking about the name I was like "Who's the best personal assistant of all time?"
00:02:30.180 | And it's Lloyd from Entourage, but I forgot his name and I thought it was Floyd.
00:02:34.200 | And so Floyd is the name of the app and Lloyd is the name of the personal assistant from Entourage.
00:02:40.720 | So there is the repo available for this.
00:02:45.380 | We're going to walk through a demo. We're going to walk through some of the code.
00:02:50.100 | It's okay if you don't get it totally up and running right now.
00:02:54.260 | There are a number of third-party platforms that we are going to string together to make this work.
00:02:59.220 | And so for it to work for you end-to-end, we need Google Cloud with the speech-to-text API enabled,
00:03:07.140 | a Convex account, an OpenAI key, and a Twilio account.
00:03:13.220 | If you have all those things, wonderful.
00:03:15.940 | If not, I'm more than happy to help get anybody set up and see if we can get Floyd working for you on your machine after the talk.
00:03:25.940 | So this is what the .env file is going to look like.
00:03:28.660 | We've got the OpenAI stuff, Twilio stuff,
00:03:30.660 | some Convex things, and some Google stuff.
00:03:35.380 | Yeah?
00:03:37.380 | It's not, but it will be.
00:03:41.380 | I'm not sure.
00:03:44.260 | I will send it.
00:03:45.380 | I'll make sure that it's available after the talk for sure.
00:03:49.380 | So a little bit of the higher-level architecture here.
00:04:00.660 | This is, at a high level, how Floyd works.
00:04:08.100 | So you'll have an application, either like your phone or a web app.
00:04:12.820 | It will take a voice request from the person, you, who needs help.
00:04:18.660 | It'll take a voice request.
00:04:20.820 | It'll transcribe that in real time by streaming to Google Cloud, which will stream back the transcription.
00:04:29.540 | So you could say something like,
00:04:32.420 | "When's my next dentist appointment?"
00:04:37.540 | Or, "Book me a dentist appointment," or something like that.
00:04:39.540 | It'll make that transcription.
00:04:41.540 | It'll make that transcription.
00:04:42.900 | And then what happens is that the client will simply save that request, the user who made it,
00:04:49.620 | and that request into the Convex database, and then hands off.
00:04:55.060 | What the server is doing is listening for changes to new requests in real time from that same Convex database.
00:05:07.540 | It's not polling.
00:05:09.620 | We're not pinging the server to let it know.
00:05:11.620 | It's got a reactive query that it's effectively subscribed to.
00:05:18.420 | And so when a new request comes in,
00:05:19.940 | the server is able to simply pick that up and start working on it and provide status updates along the way,
00:05:28.740 | which, on the other hand, the client is able to then present back to the user.
00:05:35.060 | So the request goes in, the server picks it up, and maybe sets it to
00:05:39.060 | in progress.
00:05:40.420 | That status in progress is, again, picked up for free automatically in real time by the client,
00:05:48.500 | which is also subscribed to that database.
00:05:50.820 | And we'll see what that looks like.
00:05:53.380 | So we save the request on the server.
00:05:56.020 | The server will go and look up what it knows about the person who made the request.
00:06:04.420 | This is a platform that, over time, becomes more and more knowledgeable about you.
00:06:13.940 | Every time you make a request, or you ask it something, or you give it information,
00:06:17.780 | or you connect your email, etc., it will start to learn things about you.
00:06:22.900 | What kind of car you drive?
00:06:23.940 | Where your last mechanic shop was?
00:06:26.260 | What school your kids go to?
00:06:27.700 | You can provide as much or as little of this as you want.
00:06:31.460 | You can wait for Floyd to prompt you to ask these questions.
00:06:34.660 | But this is the type of information it's going to need to know to, say,
00:06:38.100 | call the school and let them know that your kid's going to be late.
00:06:40.340 | Okay?
00:06:41.940 | So the server is going to take what it knows about the person who made the request,
00:06:47.700 | and it's going to save that context as a moment in time onto that request.
00:06:54.580 | So now we have this request object in the database.
00:06:57.780 | It's got the person who made it.
00:06:59.940 | It's got the request itself.
00:07:01.140 | It's got some context.
00:07:02.660 | And basically, it's ready to go.
00:07:04.100 | So the server will then take that, and it'll work with OpenAI,
00:07:08.980 | like the ChatGPT integration, 4.0 in this case,
00:07:12.580 | to effectively provide ChatGPT with the request
00:07:20.420 | and with the context that it needs to fulfill that request
00:07:26.740 | and say, "This is your job now.
00:07:30.740 | Are you ready to help us out?"
00:07:32.180 | At that point, OpenAI is like, "Yeah, we got this.
00:07:38.100 | I think I know what to do."
00:07:40.260 | And it's like, "Great.
00:07:40.980 | Okay.
00:07:41.300 | We're going to make a phone call right now.
00:07:43.060 | The next thing is going to be somebody on the phone,
00:07:46.180 | and you're talking to them."
00:07:48.020 | Okay?
00:07:48.420 | So at that point, the server now has this great starting point
00:07:53.220 | with the help of OpenAI to make the phone call.
00:07:58.740 | So there's a number of ways that can find the phone number for whatever the request might be.
00:08:04.500 | Typically, that would exist in your prior context.
00:08:08.580 | If not, now this doesn't exist yet, but if not, the idea would be,
00:08:13.540 | we understand the request.
00:08:15.940 | We know the general area where you live.
00:08:18.820 | We would do the work to look up the phone number if, for example,
00:08:22.340 | it wasn't already existing in your context.
00:08:24.820 | And if we can't find it, at that point, we could send you a text or something like that.
00:08:30.980 | Floyd would send the text and say, "Hey, do you have a preferred vendor?"
00:08:35.380 | We couldn't find a mechanic in your history.
00:08:37.140 | We couldn't find any online.
00:08:38.900 | Is there anybody that you would like to use?
00:08:41.060 | So at that point, the server will make the phone call.
00:08:47.220 | And it does that by coordinating the conversation through GPT-4,
00:08:56.420 | using OpenAI's text-to-speech, streaming that through Twilio.
00:09:04.180 | And then as the person on the other side of the call is speaking,
00:09:09.780 | we're streaming that audio to be transcribed in real time through Google Cloud,
00:09:14.820 | which we then feed back into GPT-4 to carry on the conversation.
00:09:18.820 | So we're streaming this as fast as possible.
00:09:24.020 | I am super impressed by the technology that we have access to today.
00:09:30.660 | The fact that this works and isn't painfully slow, it boggles my mind.
00:09:38.900 | We're very fortunate to operate in the ecosystem that we do.
00:09:43.700 | So this loop that you're seeing right here,
00:09:48.340 | this happens over and over until the conversation is complete.
00:09:52.500 | So the audio bytes come back and so on and so on.
00:09:55.780 | Yeah?
00:09:56.340 | Experimenting with multimodal Gemini that you can pass audio
00:10:02.740 | Not yet, not yet.
00:10:11.620 | Oh yeah, so the question was have I been experimenting yet with the multimodal Gemini
00:10:17.620 | where you can pass it audio and basically skip that step of transcribing, right?
00:10:23.220 | I haven't.
00:10:24.500 | But the beauty of this is that I'm designing it in such a way that
00:10:31.460 | we should be able to swap in and out different services because they're coming out so fast.
00:10:38.980 | You know, when I started this, that didn't exist.
00:10:40.740 | And it's exactly the type of thing I would experiment with and see if I could make it even faster, right?
00:10:47.780 | I think like the faster you make this, there's no limit to how quick this should be when it's operating.
00:10:55.540 | So I think that's an awesome thing to explore.
00:11:00.420 | So as this loop is happening, we are actively saving each part of this transcript as part of that request to the Convex database.
00:11:10.420 | So we're just saving it, we're just pushing it, we're just appending to the database.
00:11:15.460 | What that means is that on the client, because Convex is a reactive database for free, it's like a one liner in the client.
00:11:22.900 | Instead of saying use state, you just say use query once.
00:11:27.460 | And it will update when the database updates.
00:11:30.100 | So as we are writing the transcript to the database, your client, for example, could be streaming that back in real time.
00:11:45.700 | All right, well, let's dive in.
00:11:46.900 | I feel like a see if we can get a demo going first and foremost, why not?
00:11:57.860 | What could possibly go wrong?
00:11:59.300 | Okay, so this is sort of a shell.
00:12:08.340 | It's very much a work in progress.
00:12:09.700 | I wouldn't exactly call this production ready, but it's it's fun and it does work.
00:12:15.220 | So I'm going to make a voice request to Floyd here, who is hopefully going to see if they can help me out.
00:12:27.460 | Now, for development, I have overwritten every phone number it would call to just call me.
00:12:35.300 | So I pretend like I'm the vendor.
00:12:36.900 | I'm not, I'm not totally confident enough for it to call like a real business and not,
00:12:44.500 | and not totally embarrass me or do something crazy.
00:12:48.420 | So still in development.
00:12:50.580 | So let's see.
00:12:51.220 | Let's see here.
00:12:53.220 | Let's see what we can have Floyd do for us.
00:13:00.420 | Hey Floyd, can you call the school and let them Mara is going to be home.
00:13:04.500 | She's sick today.
00:13:05.300 | So here we have the request ID.
00:13:12.340 | It'll automatically send that request when you stop speaking.
00:13:15.700 | Down here.
00:13:21.460 | I'm the school.
00:13:22.100 | Hello, this is Mara School.
00:13:25.940 | Hey there, this is Floyd.
00:13:30.580 | I'm calling on behalf of my client, Tom Redmond.
00:13:32.660 | Mara Redmond is staying home today.
00:13:35.460 | She's sick.
00:13:36.020 | Oh no, is she going to be okay?
00:13:39.700 | Yeah, just a bit under the weather right now.
00:13:45.380 | Okay.
00:13:46.740 | That's good.
00:13:47.060 | Do you have any idea when she's going to be back?
00:13:56.900 | She's hoping to be back by the end of the week.
00:13:58.740 | I'll keep you updated if anything changes.
00:14:01.380 | Okay.
00:14:02.820 | That's great.
00:14:03.220 | Thank you.
00:14:09.860 | Maybe I should try that Gemini thing.
00:14:11.060 | Thanks for understanding.
00:14:12.260 | Have a good day.
00:14:12.820 | So again, you can see like it's there.
00:14:18.340 | The latency is so crucial though.
00:14:20.100 | Now this is all in like development land.
00:14:22.820 | Moving this over to production in every one of these platforms would, you know,
00:14:28.260 | that would be the next thing I try.
00:14:29.460 | Make it go faster.
00:14:30.660 | Try different models.
00:14:31.620 | There's different things that I've already done in terms of the audio encoding
00:14:37.140 | to make the streaming as fast as possible.
00:14:39.940 | There's a format called Opus, which is designed for phone call level quality
00:14:45.860 | and is encoded to be streaming really quick.
00:14:49.540 | So we can see here, here's the request that came in.
00:14:56.020 | And again, all I did in the client was send the request text to the database.
00:15:08.020 | Uh, the client has listening for those, any requests that match this user and is going
00:15:12.740 | to automatically update, uh, the app.
00:15:15.460 | And so the first thing I asked open AI to do is come up with an action plan.
00:15:19.300 | So we're not saying any of this, but in some ways I have open AI kind of prompt itself.
00:15:27.620 | I say, Hey, we're a team here.
00:15:29.860 | We're going to help, um, we're going to help Tom in this case, you know, do something.
00:15:35.380 | They've got a request.
00:15:36.340 | Uh, let's, let's work on this together.
00:15:38.180 | Here's the request.
00:15:38.820 | Here's the context.
00:15:39.940 | And so then this is what open AI says.
00:15:41.620 | Here's based on the context, it'll look up the context on me, what it knows about me.
00:15:46.100 | And it'll say, okay, well, he pulls out all the important information required for that request.
00:15:50.820 | So Mara's full name, we know it's, uh, Mara Redman, uh, school name, reason for absence,
00:15:56.580 | today's date, steps to fulfill the request, dah, dah, dah, dah, dah.
00:16:00.580 | And then I send this back in, um, as additional context when, uh, we're actually making the phone call.
00:16:09.300 | So it's like opening eyes got these instructions.
00:16:11.380 | It's just sitting with, yeah.
00:16:12.500 | Uh, I haven't tried maybe, um, I haven't, it, I certainly haven't done anything specific for that.
00:16:29.140 | But it definitely feels a lot more straightforward to do something like that.
00:16:35.140 | Um, than interacting with a human.
00:16:36.580 | So I, I wanted to get the, the thing I didn't know was going to work out of the way.
00:16:42.020 | Um, I feel like navigating a call tree is definitely a solvable problem.
00:16:46.980 | This was like really a curiosity for me to see if we could get this to work.
00:16:51.940 | And you can see down here, um, I'll see if I can zoom in.
00:16:55.860 | My, uh, CSS skills are, are, uh, are lacking here.
00:17:01.380 | But there's a transcript here that comes in, uh, in real time as well.
00:17:06.580 | And maybe I can actually do another, uh, let's make another request.
00:17:12.420 | Hey, Floyd, can you order Krista some flowers?
00:17:26.100 | Now we can watch this action plan and everything coming in real time.
00:17:29.620 | Now I'm going to be the flower shop.
00:17:30.900 | So let's break this down.
00:17:33.940 | Okay.
00:17:39.220 | Sweet Violet's flowers.
00:17:43.460 | How can I help you?
00:17:52.820 | Hey there, this is Floyd.
00:17:53.940 | I'm calling on behalf of my client, Tom Redmond.
00:17:56.020 | Um, I need to order some flowers for delivery today.
00:17:59.620 | Uh, okay.
00:18:02.340 | Do you have any idea what kind of flowers you would like?
00:18:05.940 | I know Krista likes your usual offerings.
00:18:13.940 | Could you recommend some popular from your collection?
00:18:16.020 | Yeah, we have some, uh, peonies and some roses for $560.
00:18:20.900 | Would you like that?
00:18:21.540 | That sounds good.
00:18:28.420 | Can you deliver it to an address in Guelph ON today?
00:18:31.540 | Yeah, absolutely.
00:18:33.540 | I think we have, we have, have you been here before?
00:18:36.420 | Do we have your information on file?
00:18:38.020 | I'm calling on behalf of Tom.
00:18:43.940 | Do you need his details again or do you have them on file?
00:18:48.100 | Sorry, I'll have to call you back.
00:18:49.460 | Haha.
00:18:51.060 | That's right.
00:18:52.900 | So I have, uh, this like bail situation built in where it's like, all right, if you're in a pickle
00:19:00.820 | and you just don't have the information that they're asking for, you say, I'm sorry,
00:19:05.540 | I'm going to have to call you back.
00:19:07.460 | And at that point, Floyd's like, I'm missing something.
00:19:12.580 | I need something.
00:19:14.340 | And it would send me a text saying, hey, tried to order you flowers, um,
00:19:19.220 | but you didn't give me a budget and they wanted to charge you $560 for some peonies.
00:19:26.580 | Um, all right.
00:19:27.780 | So what we're looking at here
00:19:32.180 | is the, uh, the database in, in Convex that's storing the request information.
00:19:42.740 | So here we have some, uh, requests and you can see, hey, Floyd, can you order Krista some flowers?
00:19:50.660 | So I wish my, sorry about my, my CSS skill.
00:19:53.780 | Oh, there we go.
00:19:54.340 | Okay.
00:19:54.580 | That kind of works.
00:19:55.140 | Can you order Krista some flowers?
00:19:57.140 | So let's find that.
00:19:57.860 | Here we go.
00:19:58.420 | Hey, Floyd.
00:19:59.540 | So I can go ahead and just update this straight in the database, uh, some flowers.
00:20:04.820 | I'm going to say there is no budget.
00:20:07.300 | Um, and then we'll just save that.
00:20:11.860 | And then watch what happens in the client when I change the database.
00:20:14.660 | That's it.
00:20:16.900 | That's why when I append parts of that transcript to the database and I'm listening for those changes
00:20:24.180 | in the client, they, uh, they just appear.
00:20:29.140 | Uh, what I also have in here are, uh, for users, we have this context.
00:20:36.420 | Now I want to, I want to, uh, preface this by saying like the actual code of what I'm doing here
00:20:42.980 | is far from best practice.
00:20:45.140 | So like, don't take this and try to roll it into production.
00:20:48.580 | This is definitely a proof of concept.
00:20:51.540 | Prior to using Convex for this, I was trying all sorts of things
00:20:58.740 | things to get that phone call, the transcript of that phone call to stream back live to the client.
00:21:09.060 | In fact, I wanted to get it so that you could listen in live to the client.
00:21:16.180 | I was really struggling with that.
00:21:19.540 | Um, I was, I didn't roll my own server to begin with.
00:21:23.940 | I was using, uh, next JS, which is fantastic, but I was hosting it on Vercel,
00:21:30.020 | which doesn't play very nicely with, uh, web sockets.
00:21:35.460 | Web sockets, right?
00:21:36.660 | Most of Vercel's hosting is, uh, is serverless and web sockets are inherently stateful.
00:21:43.220 | So while I was able to get, um, parts of this working with socket IO, for example, the first
00:21:52.260 | transcription, having it actually interact with the phone call and streaming that audio with Twilio
00:21:59.780 | through a web socket protocol, um, I found to be difficult on the front end.
00:22:07.780 | And so I was trying to do this all in one place.
00:22:11.540 | With Convex, I was really able to separate those concerns and just listen for the things
00:22:22.420 | I wanted the, the bits of data I was interested in and let the server manage whatever server it is.
00:22:32.020 | In this case, it's, it's, uh, an express server that, um, that I've written and let it do the lifting
00:22:39.300 | and basically just post updates to the database.
00:22:41.620 | So we don't, we don't have to on the client.
00:22:43.860 | We don't have to, we don't have to pull, um, we don't have to post anything.
00:22:49.460 | And so it's all, uh, it really simplified the separation of, uh, of concerns around this.
00:22:58.260 | Now, the really, really interesting thing here is that if you're not using a serverless
00:23:03.620 | hosting infrastructure, you could do this entire thing in tech, like in your client code base,
00:23:14.020 | when you're coding with Convex, you don't necessarily have to, uh, break out your server code from your front
00:23:22.820 | end code. The whole value prop is that you're able to actually build a full server in client land.
00:23:30.260 | It doesn't actually get served on the client. It gets deployed to a Convex server, but you can define your
00:23:37.460 | backend and your APIs and your schemas and your databases all in the same code base as your front end.
00:23:46.260 | Um, and I'll show you exactly what I mean about that. So, uh, let's close this here.
00:23:54.980 | So here I have, uh, the, the web, I have my, uh, my web client.
00:24:01.220 | Um, let's take a, take a look at the page that, uh, that shows
00:24:10.740 | that list of requests and then the request details.
00:24:13.460 | This line here,
00:24:16.420 | this line fetched requests, that's it.
00:24:23.380 | Anytime a request is updated or changed and it matches the query I've defined
00:24:29.140 | in, uh, this, this Convex function, it'll update my React client, uh, however I want.
00:24:37.060 | So let's see, what, what do I do with this? Obviously I provide, uh, so it gets updated.
00:24:41.540 | I provide, uh, a list of, uh, fetched requests. So requests. So I pass in my requests, which gets
00:24:50.340 | updated and updated, um, into my, into my dashboard. And then from there, I just, it's an array of
00:24:58.180 | requests. I list them out. It's got the details, uh, of the requests that I can use when somebody
00:25:04.020 | clicks on something in a list. I can show those details in the, in the detail pane. Um, and I think
00:25:11.940 | what's interesting here is the way that I've defined, um, the get requests
00:25:20.740 | is just this query.
00:25:28.180 | So this query itself actually right now is not, uh, user specific, but typically you would probably
00:25:34.980 | add in some user ID and some auth. This, this, this prototype does not have auth. Again, do not ship
00:25:40.980 | this. Um, so this is the query get. And if you recall in my last, uh, see if I can find it here.
00:25:55.780 | In my page, remember I said API requests get API is a generated type safe, uh, convex model that gets,
00:26:09.700 | um, updated and deployed every time you make a change to one of those convex files.
00:26:14.900 | So API.requests.get. And what that's doing is specifically hitting, uh, this, this function here,
00:26:20.980 | I can name this whatever I want. This just happens to be a get request. Um, and so then this is this
00:26:27.700 | function, this is a query that actually lives not on the client. Even though my code is here in client land,
00:26:40.660 | it doesn't get shipped with the client, it gets built and deployed to convex. And this function
00:26:48.820 | physically runs on the convex server on the same machine as your database.
00:26:58.900 | So this, there's a, there's a custom V8 engine that's running next to the database that's actually
00:27:06.580 | executing the JavaScript or the TypeScript that you define here. And this makes it extraordinarily fast.
00:27:16.660 | It absolves you from ever having to think about caching. Yeah?
00:27:21.380 | Um, I do, um,
00:27:23.380 | my question would be, "What's the big, say, reason for case-changers contribute convex over any of the major existing databases?"
00:27:35.380 | Yeah, that's a good question. So the question is, why would you like, what's, what's the big compelling reason
00:27:47.940 | to use something like convex over, um, the other, you know, like common databases like Mongo or Postgres?
00:27:57.140 | And the big thing is the developer ergonomics. You don't need a backend and/or infrastructure engineer or team if you're using convex.
00:28:11.860 | You get, um, type safety all the way through from your database through to your front end,
00:28:22.580 | and you get all the wonderful completions that happen, uh, with that as you define your schema.
00:28:28.660 | Um, and you can operate in a much more simple code base. And so again, I don't have to stay, like this,
00:28:41.940 | this, what you're seeing here is the totality of my, of my backend server as far as the requests, uh, table goes.
00:28:52.500 | Uh, and so I've added different things like get requests, uh, by ID, get pending requests, right?
00:28:59.380 | So we can do, like, different filtering, um, post request. The post request is interesting because,
00:29:06.100 | again, in, in, in front-end land, you can create an HTTP server that just has arbitrary endpoints with
00:29:15.060 | arbitrary responses. And you can use those with or without hitting the database. Um, but you can simply
00:29:22.260 | define a get root or a post root in your, uh, HTTP, um, actions and do with it whatever you want. So in this
00:29:31.620 | instance, what I'm doing, again, because, um, because I need, I need the server here to be able to stream to
00:29:39.860 | Twilio. Any, any other platform or server that supports, um, like WebSockets natively, you can,
00:29:46.980 | you can do it all in one place. So what I've created here is an HTTP post request that posts the, uh, the,
00:29:53.860 | the data, uh, or the request of the user, um, to the, to the server, which also happens here.
00:30:04.180 | So here I've added the post request and, um, um, this is going to run the action,
00:30:17.940 | which goes ahead and updates the database. Yeah.
00:30:27.620 | Sorry. The only way to debug this is to, so, uh, note we have, uh, on the dashboard, you get a full,
00:30:37.140 | uh, full access dashboard that includes all of, uh, all of your logs for all of the requests. It
00:30:44.500 | includes the definitions for, uh, for your functions and you can actually run your functions from the
00:30:51.460 | dashboard as like test functions and to see, um, what the response is or yeah. Pardon.
00:31:07.140 | Uh, that's a good question. Um,
00:31:12.820 | Right. So the, the question is, um, can you hit break points in the convex server code? Honestly,
00:31:25.460 | I don't know. Can I get back to you on that? I can, I can definitely follow up. Um, that's a,
00:31:30.900 | that's a, that's a great point. That's a great, that's a great question. Um, I've typically ended up relying
00:31:36.180 | on both the logs and then the, uh, the convex, um,
00:31:42.740 | client that is doing the, uh, compilation on your machine will also give you, give you any, uh,
00:31:53.140 | any errors, but that's, that's a great question. Let me, let me get back to you on that one.
00:31:56.580 | Um, okay. And so here we have, try to make this a little bit easier to see.
00:32:03.940 | Toggle. Toggle. Uh,
00:32:08.660 | Okay. Maybe not.
00:32:16.660 | All right. So here we have the request being saved. Um, this is a little bit of an esoteric way to do it
00:32:28.340 | again, just because of the, this, this need to stream. And then at the same time, what I wanted
00:32:33.060 | to do with the stream was write that transcript in real time as it's streaming into the database. Uh,
00:32:39.140 | And so what, uh, what that looks like is here in, uh, in the server,
00:32:48.420 | I've created a convex client right within the server that will effectively get an update every single
00:33:04.020 | time. Um, this database, uh, changes, again, based on the query I define. Um, so in this case,
00:33:10.980 | I'm asking Convex to ping my server, uh, anytime there's a change to a pending request. And just by
00:33:20.420 | convention, I've, when somebody makes a request from the client, um, it's pending by default, the server will
00:33:29.460 | get that the server will then, um, make sure that it exists. And the first thing it does is change the
00:33:37.220 | status to in flight. Um, and then it will start taking action on, uh, on that request.
00:33:48.580 | And so we will then get the full request, which has the, uh, the, the actual like ask of, of, of the
00:33:58.660 | client. Um, we'll get the, in this case, we get the session name and then we do this like gather context
00:34:04.100 | with open AI. This is where, uh, this function is reaching into the database and trying to pull
00:34:10.820 | out everything it knows about the user who asked this request. Uh, then I, then I update the context.
00:34:16.820 | So I save that. So I have this context and I filter out the parts of that that are required to actually
00:34:22.420 | fulfill this ask. And I save just that sort of subset of that context into, uh, into the request itself,
00:34:31.300 | just for ease of access. Again, you know, if we sit down and do like a design session on this,
00:34:36.820 | there's going to be a lot of changes to make. Um, at that point I say, okay, make the call.
00:34:43.460 | Uh, and so the way that, uh, the way that Twilio works is that it will, uh, I can use the Twilio
00:34:54.100 | client and I can make a call. And when that call connects, this config tells Twilio where to hit me
00:35:02.500 | back when the status of the call changes. Um, and also any streaming data that's coming in. And so I have a
00:35:09.460 | web socket opened up here. So I have ngrok running, um, on my server right now so that Twilio can hit it.
00:35:16.340 | Uh, so I have a, uh, web socket server here that I just set up in express. Um, and when that gets hit
00:35:25.060 | with, uh, with a Twilio event, which sometimes is like call initiated, you know, data heard, call ended,
00:35:32.100 | that kind of thing. Um, I have this web socket up here. So I ask in the request to make the call,
00:35:43.300 | I pass the request ID. And when Twilio calls my server back, I've asked it to call my server back
00:35:50.580 | with the request ID. So that's how I can pass through that data from, from Twilio because I'm,
00:35:56.500 | I, once I send off that call, I'm just waiting. It's, it's, it's gone into the ether and you hope
00:36:02.820 | that like you're going to get the phone call from Twilio. So I needed some way to be able to track that
00:36:07.700 | request ID so that when the call actually connects, um, I know the context of what the call is all about,
00:36:16.500 | and what the request is about. And so I have Twilio send the web socket request to, uh, this location
00:36:24.500 | with the request ID. And at that point I open up the, the media stream handler, which grabs the request ID.
00:36:31.300 | Um, and then, uh, and then looks it up. And then there's, there's a number of fairly standard,
00:36:39.460 | like media stream functions in here, things like process message, which gets called, um, a lot,
00:36:46.020 | right over and over and over. Uh, you can configure Twilio to, to send you different things. Like you
00:36:52.340 | can have it, uh, send you fixed size chunks. You can have it send you the audio data after every
00:36:59.300 | utterance or after every pause. Um, and so what I have here is, is basically every, I have Twilio sending
00:37:07.060 | me the audio data, the streaming audio data, uh, basically after every utterance, which is just like
00:37:13.300 | commas and, and, and, and periods and things like that. And so when I get that, um,
00:37:20.020 | I take the audio data, I convert it into, uh, into a file type that Google Cloud is, uh, fast with.
00:37:35.940 | Uh, I get the transcription back and then, uh, I convert, uh, I get the transcription back.
00:37:44.100 | I ask open AI to then speak it. Where's the speaks? Yeah. So here's the speak function.
00:37:52.660 | Uh, I then convert that into a streaming format that Twilio is down with, um, for streaming,
00:38:00.740 | uh, which is, uh, taking this moolah file, uh, into, and then streaming it as, as base, uh, base 64
00:38:09.620 | back to, uh, back to Twilio. And then throughout this whole time, um, you can see here client mutation,
00:38:16.500 | add to transcript. So every time I'm getting a new transcript entry back from Google Cloud during the
00:38:21.860 | stream, I'm just updating the convex database. Like that's just happening. And then the client is just
00:38:27.460 | subscribed to those changes and can, you know, show them in a list or I don't know, somebody who's better at
00:38:33.300 | UI than I can, can make that look really nice and like have it, uh, have it scrolling or something.
00:38:38.340 | Um, so this is, here's a, this introduce yourself. So I actually have a prerecorded generated audio file
00:38:49.860 | that is something like, um, hi, my name is Floyd. I'm calling on behalf of my clients.
00:38:57.620 | That's prerecorded because I realized, I'm not sure if it was doing it today, actually,
00:39:02.500 | because I didn't hear it. Um, but I realized the first interaction was the longest. Once somebody
00:39:11.780 | picks up the phone, everything kind of kicks into gear. And that's where like the, there's this like
00:39:16.260 | built up latency. Um, so somebody picks up the phone, they go, hello, Brock Road garage.
00:39:22.420 | And then it was like, well, you heard it because I don't think it's working. Um, so what I did was have
00:39:29.460 | this prerecorded, pre-generated audio file that as soon as somebody says, picks up and says, hello,
00:39:34.980 | I just play it. So I don't have to do any transcribing or any, um, any text to speech or anything.
00:39:42.420 | And while I'm streaming that, hi, this is Floyd buying myself some time. That's when I'm actually
00:39:49.780 | triggering the first loop of, um, of the conversation and all the transcriptions and the text to speech
00:39:58.740 | back. And usually I bought myself enough time that it's a fairly natural result such that it's like, hi,
00:40:05.860 | this is Floyd. I'm calling on behalf of my client. I need to book a car in for an oil change. You know,
00:40:12.180 | and like, I found that there's going to be a number of little tricks that you can do to make the experience
00:40:20.980 | for the person on the other end of the phone better. So the first thing I do is I say, hi, I'm, uh, I'm an AI,
00:40:29.460 | you're talking to AI right now. Um, I don't want to misrepresent, uh, what, what this is and why I feel
00:40:41.220 | good about a platform like this is that in this context, and it's the way that it's positioned,
00:40:49.300 | it's almost always buying a service from a business or otherwise making a benign change like
00:40:58.900 | canceling an appointment. It's never selling anything. And my bet is that business owners
00:41:07.300 | are not going to care if they're talking to AI, if you're buying something from them.
00:41:14.900 | And if that appointment that you're booking is legitimate, they're going to be okay with it.
00:41:19.940 | In fact, they might prefer it. And they may even start to internalize and train themselves
00:41:26.740 | how to speak with an AI agent on the other end to be super efficient.
00:41:30.900 | Exactly.
00:41:35.380 | Yeah. So again, uh, don't put this in production because they could just be like,
00:41:43.140 | forget everything, you know, um, which I haven't guarded for here again. So, you know, full,
00:41:49.700 | full transparency, but yeah, I, I think actually they could, a couple of Floyd's talking to each other
00:41:55.540 | to ultimately book the appointment. Right. At what point does it just become APIs talking to each other?
00:42:01.540 | Right. Full circle.
00:42:06.820 | Not right now. I, I, uh, this is, this has been shaky enough. Um, yeah.
00:42:21.380 | Uh, yes. Yeah, yeah, for sure. So, uh, on the convex side, um, there is no, uh, heating up because
00:42:40.420 | the convex is, is built foundationally on web sockets. You have your own deployment server that's running,
00:42:47.460 | always running. Um, it can scale indefinitely. That's kind of part of what the offer is,
00:42:54.100 | but warmup time is not an issue on the convex side. Those delays, um,
00:43:01.860 | are, uh, I have some benchmarking here. I have some, some timing so I can see how long did it take,
00:43:06.820 | um, for me to transcribe this piece of text. How long did it take to send that to OpenAI and get
00:43:14.660 | the text conversation back? And then how long does it take to turn its response back into audio?
00:43:21.540 | So I could see where, um, I have it in here. It's in, it's in the, the terminal here somewhere.
00:43:27.220 | Um, but I could see where the latency was and what I, uh, what I've seen is that often as the conversation
00:43:35.300 | grows, the prompt I'm sending to OpenAI that includes all of the previous conversation
00:43:42.580 | takes longer and longer and longer. And so the latency and the delays on average tend to get
00:43:49.540 | worse the longer the conversation goes. Now, I also originally had built that functionality before,
00:43:56.180 | um, threads and the threads, the OpenAI threads API was available. That would be something that I would try.
00:44:05.540 | I would work diligently to minimize every prompt I'm sending to OpenAI. Uh, that would be,
00:44:14.420 | that would, I think would have a really big impact. Um, in terms of the, uh, the other thing that can be
00:44:23.300 | slow. If you send a large piece of text to the text to speech on OpenAI, that can be slow. That can be,
00:44:32.500 | like, like if you send, if it's three sentences, it can still take two, three, four, five seconds.
00:44:39.300 | Um, and so there are a couple of parameters that you can tweak with the OpenAI, uh, text to speech stuff.
00:44:48.580 | Um, but not a lot. And so what I would do to fix that is I would pay more for lower latency
00:44:58.820 | or I would use another service that I could pay for that would give me lower latency.
00:45:05.540 | Um, the way that this is now. So like in those phone calls, what, what you heard and what you experienced,
00:45:13.060 | those delays, there are still optimization opportunities, um, like crazy to, to bring that all down.
00:45:24.100 | I'm actually not even concerned with that right now because there's still half a dozen material things.
00:45:29.460 | I haven't tried to close that gap. I was just happy to have a conversation with a computer,
00:45:34.020 | um, that I could ask questions to, but that, yeah, the, the, the latency comes from
00:45:38.900 | the, uh, the, the sum of all of the different interactions that are happening. And so if you
00:45:44.580 | speed up any of one of those along the chain, it's going to be faster and faster and faster.
00:45:49.380 | Um, and second to that point, uh, everything that's coming out every couple months, there's
00:45:56.020 | like this massive improvement in some API in this stack. And so again, the bet is that like,
00:46:03.620 | well, I can get it as close as possible right now, but I know in six months, it'll be twice as fast
00:46:08.180 | without me doing anything. It's just the rate of innovation and the rate of change in competition
00:46:13.140 | right now for this type of thing is so high. Um, that, that, that'd be a bet that I'm,
00:46:20.260 | I'm taking on and be like, yeah, maybe it's not perfect right now, but it inevitably will be very soon.
00:46:25.540 | Um, all right. Let's see here. What, what time are we at? Okay. Uh, just a couple more minutes. Okay. Um,
00:46:36.020 | any, any other questions I can walk through any of the server stuff, any of the front end code? Yeah.
00:46:41.060 | Yeah, that's a, that's a, I love that. Um, so the question was
00:47:09.940 | a little bit more context about Convex, um, and how it works under the hood, how it distributes its
00:47:18.020 | queries, what kind of, uh, database infrastructure is it running on? So the team that Convex is, um,
00:47:27.540 | it's open source, um, but it's a custom built from the ground up database. It's literally a database
00:47:34.500 | built from scratch to be able to solve all of these, to be able to basically provide this product and
00:47:41.620 | the people who built it. So my, my boss is the CEO and the CTO, Jamie and James. Uh, Jamie's here today.
00:47:47.940 | Actually, he'll be doing a, some, a keynote speak. I joined Convex because of them. They have,
00:47:53.540 | they have this track record that was like mind blowing to me. And it was only about six months
00:48:00.260 | ago. I discovered them. And the more I was reading about it, the more I was like, these guys did what?
00:48:06.260 | Like they, they built from scratch in Rust, a brand new database. So James, uh, has, uh, his, his PhD
00:48:16.660 | in, uh, database architecture from MIT. And so he was instrumental in basically designing a novel
00:48:23.620 | database to make all of this work. And so the way that it runs is,
00:48:30.020 | it actually runs on, uh, an AWS cluster and it's running the Convex database and application,
00:48:37.140 | which manages all of the WebSocket connections and all of the subscriptions. Um, in terms of like,
00:48:44.420 | literally and physically how it's distributed, I'd love to follow up. Like that's, that's,
00:48:49.140 | it's deeper than my expertise. Um, but we do have some large customers, uh, using it these days.
00:48:58.180 | Um, there haven't been any, uh, fundamental issues at all in terms of its ability to scale. We've been
00:49:05.700 | very, you know, happy with how that's worked out so far. Um, and then, uh, yeah. And so it's,
00:49:12.900 | you could take a look at the, take a look at the open source repo. Um, it's super interesting.
00:49:18.340 | And there's a really great blog post written by, uh, our chief scientist, Sujay called how Convex works.
00:49:25.460 | And it does a deep dive into the architecture of, uh, of the database. Um, and so to be fair,
00:49:33.140 | sometimes people are like, well, like, uh, you know, is that risky not to do like Postgres or something
00:49:39.700 | like that? Um, and it's, it's a bet you'll be taking, but we believe that the, the trade-off there,
00:49:46.180 | the developer ergonomics, the speed, the fact that you're, if you want to start a company, you don't need
00:49:51.860 | your infrastructure engineers, um, to be building infrastructure. You can take your infrastructure
00:49:58.900 | engineers and they can be building your product. They can be building the things. Okay. That's it.
00:50:04.900 | They can be building the things that your customers care about, not worrying about database backups.
00:50:12.820 | Cool. Thank you all so much. I really appreciate, um, your attention and your time. This was,
00:50:18.420 | this was a lot of fun. If you want any help, uh, getting the, the, the repo up and running,
00:50:23.460 | come find me. I'm, I'm happy to help see if we can get it working on somebody else's machine.
00:50:30.340 | Right. Okay. Thank you all very much.