back to indexBuilding an AI assistant that makes phone calls [Convex Workshop]

00:00:00.000 |
Well, hello everybody. Thank you for braving the early morning. 8 a.m. is rough for the best of us, 00:00:20.720 |
so I really do appreciate the crawling out of bed and coming to check out this talk. My name is Tom 00:00:28.460 |
Redmond. I am the head of DX at a company called Convex. We're building a platform 00:00:36.680 |
that people are able to build their companies on from day one to year two, 00:00:42.860 |
hopefully year 10 and year 20 and beyond. Today I wanted to walk through this idea 00:00:52.140 |
that I've had for a long time about building a better AI assistant. It came 00:00:57.980 |
to me a little while ago after trying a number of AI assistants, most of which can 00:01:05.360 |
pretty much set calendar events and reminders and timers and things like that. 00:01:10.460 |
And I thought to myself, what does a real personal assistant do for people? 00:01:18.680 |
A lot of the time, short of collecting laundry and doing physical things, they're on the phone 00:01:26.900 |
and they're on email. And so I realized I feel like there's enough technology out there these days 00:01:32.120 |
that we could actually string together a number of platforms such that you could have an AI assistant 00:01:38.840 |
that knows about you. It has context on you and your life and who you are and would be able 00:01:46.220 |
to manage a conversation in a non-creepy way with another human being and the technology exists such that we can do all of this in real time. 00:01:58.840 |
We can transcribe speech to text in real time. We can transpose text to speech in nearly real time as well. 00:02:12.600 |
And so I kind of wanted to piece all these things together and that's what we're going to go through today. 00:02:16.260 |
So this is a better AI assistant. This is Floyd. 00:02:20.640 |
This is actually Lloyd from Entourage, but when I was thinking about the name I was like "Who's the best personal assistant of all time?" 00:02:30.180 |
And it's Lloyd from Entourage, but I forgot his name and I thought it was Floyd. 00:02:34.200 |
And so Floyd is the name of the app and Lloyd is the name of the personal assistant from Entourage. 00:02:45.380 |
We're going to walk through a demo. We're going to walk through some of the code. 00:02:50.100 |
It's okay if you don't get it totally up and running right now. 00:02:54.260 |
There are a number of third-party platforms that we are going to string together to make this work. 00:02:59.220 |
And so for it to work for you end-to-end, we need Google Cloud with the speech-to-text API enabled, 00:03:07.140 |
a Convex account, an OpenAI key, and a Twilio account. 00:03:15.940 |
If not, I'm more than happy to help get anybody set up and see if we can get Floyd working for you on your machine after the talk. 00:03:25.940 |
So this is what the .env file is going to look like. 00:03:45.380 |
I'll make sure that it's available after the talk for sure. 00:03:49.380 |
So a little bit of the higher-level architecture here. 00:04:08.100 |
So you'll have an application, either like your phone or a web app. 00:04:12.820 |
It will take a voice request from the person, you, who needs help. 00:04:20.820 |
It'll transcribe that in real time by streaming to Google Cloud, which will stream back the transcription. 00:04:37.540 |
Or, "Book me a dentist appointment," or something like that. 00:04:42.900 |
And then what happens is that the client will simply save that request, the user who made it, 00:04:49.620 |
and that request into the Convex database, and then hands off. 00:04:55.060 |
What the server is doing is listening for changes to new requests in real time from that same Convex database. 00:05:11.620 |
It's got a reactive query that it's effectively subscribed to. 00:05:19.940 |
the server is able to simply pick that up and start working on it and provide status updates along the way, 00:05:28.740 |
which, on the other hand, the client is able to then present back to the user. 00:05:35.060 |
So the request goes in, the server picks it up, and maybe sets it to 00:05:40.420 |
That status in progress is, again, picked up for free automatically in real time by the client, 00:05:56.020 |
The server will go and look up what it knows about the person who made the request. 00:06:04.420 |
This is a platform that, over time, becomes more and more knowledgeable about you. 00:06:13.940 |
Every time you make a request, or you ask it something, or you give it information, 00:06:17.780 |
or you connect your email, etc., it will start to learn things about you. 00:06:27.700 |
You can provide as much or as little of this as you want. 00:06:31.460 |
You can wait for Floyd to prompt you to ask these questions. 00:06:34.660 |
But this is the type of information it's going to need to know to, say, 00:06:38.100 |
call the school and let them know that your kid's going to be late. 00:06:41.940 |
So the server is going to take what it knows about the person who made the request, 00:06:47.700 |
and it's going to save that context as a moment in time onto that request. 00:06:54.580 |
So now we have this request object in the database. 00:07:04.100 |
So the server will then take that, and it'll work with OpenAI, 00:07:08.980 |
like the ChatGPT integration, 4.0 in this case, 00:07:12.580 |
to effectively provide ChatGPT with the request 00:07:20.420 |
and with the context that it needs to fulfill that request 00:07:32.180 |
At that point, OpenAI is like, "Yeah, we got this. 00:07:43.060 |
The next thing is going to be somebody on the phone, 00:07:48.420 |
So at that point, the server now has this great starting point 00:07:53.220 |
with the help of OpenAI to make the phone call. 00:07:58.740 |
So there's a number of ways that can find the phone number for whatever the request might be. 00:08:04.500 |
Typically, that would exist in your prior context. 00:08:08.580 |
If not, now this doesn't exist yet, but if not, the idea would be, 00:08:18.820 |
We would do the work to look up the phone number if, for example, 00:08:24.820 |
And if we can't find it, at that point, we could send you a text or something like that. 00:08:30.980 |
Floyd would send the text and say, "Hey, do you have a preferred vendor?" 00:08:41.060 |
So at that point, the server will make the phone call. 00:08:47.220 |
And it does that by coordinating the conversation through GPT-4, 00:08:56.420 |
using OpenAI's text-to-speech, streaming that through Twilio. 00:09:04.180 |
And then as the person on the other side of the call is speaking, 00:09:09.780 |
we're streaming that audio to be transcribed in real time through Google Cloud, 00:09:14.820 |
which we then feed back into GPT-4 to carry on the conversation. 00:09:24.020 |
I am super impressed by the technology that we have access to today. 00:09:30.660 |
The fact that this works and isn't painfully slow, it boggles my mind. 00:09:38.900 |
We're very fortunate to operate in the ecosystem that we do. 00:09:48.340 |
this happens over and over until the conversation is complete. 00:09:52.500 |
So the audio bytes come back and so on and so on. 00:09:56.340 |
Experimenting with multimodal Gemini that you can pass audio 00:10:11.620 |
Oh yeah, so the question was have I been experimenting yet with the multimodal Gemini 00:10:17.620 |
where you can pass it audio and basically skip that step of transcribing, right? 00:10:24.500 |
But the beauty of this is that I'm designing it in such a way that 00:10:31.460 |
we should be able to swap in and out different services because they're coming out so fast. 00:10:38.980 |
You know, when I started this, that didn't exist. 00:10:40.740 |
And it's exactly the type of thing I would experiment with and see if I could make it even faster, right? 00:10:47.780 |
I think like the faster you make this, there's no limit to how quick this should be when it's operating. 00:10:55.540 |
So I think that's an awesome thing to explore. 00:11:00.420 |
So as this loop is happening, we are actively saving each part of this transcript as part of that request to the Convex database. 00:11:10.420 |
So we're just saving it, we're just pushing it, we're just appending to the database. 00:11:15.460 |
What that means is that on the client, because Convex is a reactive database for free, it's like a one liner in the client. 00:11:22.900 |
Instead of saying use state, you just say use query once. 00:11:27.460 |
And it will update when the database updates. 00:11:30.100 |
So as we are writing the transcript to the database, your client, for example, could be streaming that back in real time. 00:11:46.900 |
I feel like a see if we can get a demo going first and foremost, why not? 00:12:09.700 |
I wouldn't exactly call this production ready, but it's it's fun and it does work. 00:12:15.220 |
So I'm going to make a voice request to Floyd here, who is hopefully going to see if they can help me out. 00:12:27.460 |
Now, for development, I have overwritten every phone number it would call to just call me. 00:12:36.900 |
I'm not, I'm not totally confident enough for it to call like a real business and not, 00:12:44.500 |
and not totally embarrass me or do something crazy. 00:13:00.420 |
Hey Floyd, can you call the school and let them Mara is going to be home. 00:13:12.340 |
It'll automatically send that request when you stop speaking. 00:13:30.580 |
I'm calling on behalf of my client, Tom Redmond. 00:13:39.700 |
Yeah, just a bit under the weather right now. 00:13:47.060 |
Do you have any idea when she's going to be back? 00:13:56.900 |
She's hoping to be back by the end of the week. 00:14:22.820 |
Moving this over to production in every one of these platforms would, you know, 00:14:31.620 |
There's different things that I've already done in terms of the audio encoding 00:14:39.940 |
There's a format called Opus, which is designed for phone call level quality 00:14:49.540 |
So we can see here, here's the request that came in. 00:14:56.020 |
And again, all I did in the client was send the request text to the database. 00:15:08.020 |
Uh, the client has listening for those, any requests that match this user and is going 00:15:15.460 |
And so the first thing I asked open AI to do is come up with an action plan. 00:15:19.300 |
So we're not saying any of this, but in some ways I have open AI kind of prompt itself. 00:15:29.860 |
We're going to help, um, we're going to help Tom in this case, you know, do something. 00:15:41.620 |
Here's based on the context, it'll look up the context on me, what it knows about me. 00:15:46.100 |
And it'll say, okay, well, he pulls out all the important information required for that request. 00:15:50.820 |
So Mara's full name, we know it's, uh, Mara Redman, uh, school name, reason for absence, 00:15:56.580 |
today's date, steps to fulfill the request, dah, dah, dah, dah, dah. 00:16:00.580 |
And then I send this back in, um, as additional context when, uh, we're actually making the phone call. 00:16:09.300 |
So it's like opening eyes got these instructions. 00:16:12.500 |
Uh, I haven't tried maybe, um, I haven't, it, I certainly haven't done anything specific for that. 00:16:29.140 |
But it definitely feels a lot more straightforward to do something like that. 00:16:36.580 |
So I, I wanted to get the, the thing I didn't know was going to work out of the way. 00:16:42.020 |
Um, I feel like navigating a call tree is definitely a solvable problem. 00:16:46.980 |
This was like really a curiosity for me to see if we could get this to work. 00:16:51.940 |
And you can see down here, um, I'll see if I can zoom in. 00:16:55.860 |
My, uh, CSS skills are, are, uh, are lacking here. 00:17:01.380 |
But there's a transcript here that comes in, uh, in real time as well. 00:17:06.580 |
And maybe I can actually do another, uh, let's make another request. 00:17:12.420 |
Hey, Floyd, can you order Krista some flowers? 00:17:26.100 |
Now we can watch this action plan and everything coming in real time. 00:17:53.940 |
I'm calling on behalf of my client, Tom Redmond. 00:17:56.020 |
Um, I need to order some flowers for delivery today. 00:18:02.340 |
Do you have any idea what kind of flowers you would like? 00:18:13.940 |
Could you recommend some popular from your collection? 00:18:16.020 |
Yeah, we have some, uh, peonies and some roses for $560. 00:18:28.420 |
Can you deliver it to an address in Guelph ON today? 00:18:33.540 |
I think we have, we have, have you been here before? 00:18:43.940 |
Do you need his details again or do you have them on file? 00:18:52.900 |
So I have, uh, this like bail situation built in where it's like, all right, if you're in a pickle 00:19:00.820 |
and you just don't have the information that they're asking for, you say, I'm sorry, 00:19:07.460 |
And at that point, Floyd's like, I'm missing something. 00:19:14.340 |
And it would send me a text saying, hey, tried to order you flowers, um, 00:19:19.220 |
but you didn't give me a budget and they wanted to charge you $560 for some peonies. 00:19:32.180 |
is the, uh, the database in, in Convex that's storing the request information. 00:19:42.740 |
So here we have some, uh, requests and you can see, hey, Floyd, can you order Krista some flowers? 00:19:59.540 |
So I can go ahead and just update this straight in the database, uh, some flowers. 00:20:11.860 |
And then watch what happens in the client when I change the database. 00:20:16.900 |
That's why when I append parts of that transcript to the database and I'm listening for those changes 00:20:29.140 |
Uh, what I also have in here are, uh, for users, we have this context. 00:20:36.420 |
Now I want to, I want to, uh, preface this by saying like the actual code of what I'm doing here 00:20:45.140 |
So like, don't take this and try to roll it into production. 00:20:51.540 |
Prior to using Convex for this, I was trying all sorts of things 00:20:58.740 |
things to get that phone call, the transcript of that phone call to stream back live to the client. 00:21:09.060 |
In fact, I wanted to get it so that you could listen in live to the client. 00:21:19.540 |
Um, I was, I didn't roll my own server to begin with. 00:21:23.940 |
I was using, uh, next JS, which is fantastic, but I was hosting it on Vercel, 00:21:30.020 |
which doesn't play very nicely with, uh, web sockets. 00:21:36.660 |
Most of Vercel's hosting is, uh, is serverless and web sockets are inherently stateful. 00:21:43.220 |
So while I was able to get, um, parts of this working with socket IO, for example, the first 00:21:52.260 |
transcription, having it actually interact with the phone call and streaming that audio with Twilio 00:21:59.780 |
through a web socket protocol, um, I found to be difficult on the front end. 00:22:07.780 |
And so I was trying to do this all in one place. 00:22:11.540 |
With Convex, I was really able to separate those concerns and just listen for the things 00:22:22.420 |
I wanted the, the bits of data I was interested in and let the server manage whatever server it is. 00:22:32.020 |
In this case, it's, it's, uh, an express server that, um, that I've written and let it do the lifting 00:22:39.300 |
and basically just post updates to the database. 00:22:43.860 |
We don't have to, we don't have to pull, um, we don't have to post anything. 00:22:49.460 |
And so it's all, uh, it really simplified the separation of, uh, of concerns around this. 00:22:58.260 |
Now, the really, really interesting thing here is that if you're not using a serverless 00:23:03.620 |
hosting infrastructure, you could do this entire thing in tech, like in your client code base, 00:23:14.020 |
when you're coding with Convex, you don't necessarily have to, uh, break out your server code from your front 00:23:22.820 |
end code. The whole value prop is that you're able to actually build a full server in client land. 00:23:30.260 |
It doesn't actually get served on the client. It gets deployed to a Convex server, but you can define your 00:23:37.460 |
backend and your APIs and your schemas and your databases all in the same code base as your front end. 00:23:46.260 |
Um, and I'll show you exactly what I mean about that. So, uh, let's close this here. 00:23:54.980 |
So here I have, uh, the, the web, I have my, uh, my web client. 00:24:01.220 |
Um, let's take a, take a look at the page that, uh, that shows 00:24:10.740 |
that list of requests and then the request details. 00:24:23.380 |
Anytime a request is updated or changed and it matches the query I've defined 00:24:29.140 |
in, uh, this, this Convex function, it'll update my React client, uh, however I want. 00:24:37.060 |
So let's see, what, what do I do with this? Obviously I provide, uh, so it gets updated. 00:24:41.540 |
I provide, uh, a list of, uh, fetched requests. So requests. So I pass in my requests, which gets 00:24:50.340 |
updated and updated, um, into my, into my dashboard. And then from there, I just, it's an array of 00:24:58.180 |
requests. I list them out. It's got the details, uh, of the requests that I can use when somebody 00:25:04.020 |
clicks on something in a list. I can show those details in the, in the detail pane. Um, and I think 00:25:11.940 |
what's interesting here is the way that I've defined, um, the get requests 00:25:28.180 |
So this query itself actually right now is not, uh, user specific, but typically you would probably 00:25:34.980 |
add in some user ID and some auth. This, this, this prototype does not have auth. Again, do not ship 00:25:40.980 |
this. Um, so this is the query get. And if you recall in my last, uh, see if I can find it here. 00:25:55.780 |
In my page, remember I said API requests get API is a generated type safe, uh, convex model that gets, 00:26:09.700 |
um, updated and deployed every time you make a change to one of those convex files. 00:26:14.900 |
So API.requests.get. And what that's doing is specifically hitting, uh, this, this function here, 00:26:20.980 |
I can name this whatever I want. This just happens to be a get request. Um, and so then this is this 00:26:27.700 |
function, this is a query that actually lives not on the client. Even though my code is here in client land, 00:26:40.660 |
it doesn't get shipped with the client, it gets built and deployed to convex. And this function 00:26:48.820 |
physically runs on the convex server on the same machine as your database. 00:26:58.900 |
So this, there's a, there's a custom V8 engine that's running next to the database that's actually 00:27:06.580 |
executing the JavaScript or the TypeScript that you define here. And this makes it extraordinarily fast. 00:27:16.660 |
It absolves you from ever having to think about caching. Yeah? 00:27:23.380 |
my question would be, "What's the big, say, reason for case-changers contribute convex over any of the major existing databases?" 00:27:35.380 |
Yeah, that's a good question. So the question is, why would you like, what's, what's the big compelling reason 00:27:47.940 |
to use something like convex over, um, the other, you know, like common databases like Mongo or Postgres? 00:27:57.140 |
And the big thing is the developer ergonomics. You don't need a backend and/or infrastructure engineer or team if you're using convex. 00:28:11.860 |
You get, um, type safety all the way through from your database through to your front end, 00:28:22.580 |
and you get all the wonderful completions that happen, uh, with that as you define your schema. 00:28:28.660 |
Um, and you can operate in a much more simple code base. And so again, I don't have to stay, like this, 00:28:41.940 |
this, what you're seeing here is the totality of my, of my backend server as far as the requests, uh, table goes. 00:28:52.500 |
Uh, and so I've added different things like get requests, uh, by ID, get pending requests, right? 00:28:59.380 |
So we can do, like, different filtering, um, post request. The post request is interesting because, 00:29:06.100 |
again, in, in, in front-end land, you can create an HTTP server that just has arbitrary endpoints with 00:29:15.060 |
arbitrary responses. And you can use those with or without hitting the database. Um, but you can simply 00:29:22.260 |
define a get root or a post root in your, uh, HTTP, um, actions and do with it whatever you want. So in this 00:29:31.620 |
instance, what I'm doing, again, because, um, because I need, I need the server here to be able to stream to 00:29:39.860 |
Twilio. Any, any other platform or server that supports, um, like WebSockets natively, you can, 00:29:46.980 |
you can do it all in one place. So what I've created here is an HTTP post request that posts the, uh, the, 00:29:53.860 |
the data, uh, or the request of the user, um, to the, to the server, which also happens here. 00:30:04.180 |
So here I've added the post request and, um, um, this is going to run the action, 00:30:17.940 |
which goes ahead and updates the database. Yeah. 00:30:27.620 |
Sorry. The only way to debug this is to, so, uh, note we have, uh, on the dashboard, you get a full, 00:30:37.140 |
uh, full access dashboard that includes all of, uh, all of your logs for all of the requests. It 00:30:44.500 |
includes the definitions for, uh, for your functions and you can actually run your functions from the 00:30:51.460 |
dashboard as like test functions and to see, um, what the response is or yeah. Pardon. 00:31:12.820 |
Right. So the, the question is, um, can you hit break points in the convex server code? Honestly, 00:31:25.460 |
I don't know. Can I get back to you on that? I can, I can definitely follow up. Um, that's a, 00:31:30.900 |
that's a, that's a great point. That's a great, that's a great question. Um, I've typically ended up relying 00:31:36.180 |
on both the logs and then the, uh, the convex, um, 00:31:42.740 |
client that is doing the, uh, compilation on your machine will also give you, give you any, uh, 00:31:53.140 |
any errors, but that's, that's a great question. Let me, let me get back to you on that one. 00:31:56.580 |
Um, okay. And so here we have, try to make this a little bit easier to see. 00:32:16.660 |
All right. So here we have the request being saved. Um, this is a little bit of an esoteric way to do it 00:32:28.340 |
again, just because of the, this, this need to stream. And then at the same time, what I wanted 00:32:33.060 |
to do with the stream was write that transcript in real time as it's streaming into the database. Uh, 00:32:39.140 |
And so what, uh, what that looks like is here in, uh, in the server, 00:32:48.420 |
I've created a convex client right within the server that will effectively get an update every single 00:33:04.020 |
time. Um, this database, uh, changes, again, based on the query I define. Um, so in this case, 00:33:10.980 |
I'm asking Convex to ping my server, uh, anytime there's a change to a pending request. And just by 00:33:20.420 |
convention, I've, when somebody makes a request from the client, um, it's pending by default, the server will 00:33:29.460 |
get that the server will then, um, make sure that it exists. And the first thing it does is change the 00:33:37.220 |
status to in flight. Um, and then it will start taking action on, uh, on that request. 00:33:48.580 |
And so we will then get the full request, which has the, uh, the, the actual like ask of, of, of the 00:33:58.660 |
client. Um, we'll get the, in this case, we get the session name and then we do this like gather context 00:34:04.100 |
with open AI. This is where, uh, this function is reaching into the database and trying to pull 00:34:10.820 |
out everything it knows about the user who asked this request. Uh, then I, then I update the context. 00:34:16.820 |
So I save that. So I have this context and I filter out the parts of that that are required to actually 00:34:22.420 |
fulfill this ask. And I save just that sort of subset of that context into, uh, into the request itself, 00:34:31.300 |
just for ease of access. Again, you know, if we sit down and do like a design session on this, 00:34:36.820 |
there's going to be a lot of changes to make. Um, at that point I say, okay, make the call. 00:34:43.460 |
Uh, and so the way that, uh, the way that Twilio works is that it will, uh, I can use the Twilio 00:34:54.100 |
client and I can make a call. And when that call connects, this config tells Twilio where to hit me 00:35:02.500 |
back when the status of the call changes. Um, and also any streaming data that's coming in. And so I have a 00:35:09.460 |
web socket opened up here. So I have ngrok running, um, on my server right now so that Twilio can hit it. 00:35:16.340 |
Uh, so I have a, uh, web socket server here that I just set up in express. Um, and when that gets hit 00:35:25.060 |
with, uh, with a Twilio event, which sometimes is like call initiated, you know, data heard, call ended, 00:35:32.100 |
that kind of thing. Um, I have this web socket up here. So I ask in the request to make the call, 00:35:43.300 |
I pass the request ID. And when Twilio calls my server back, I've asked it to call my server back 00:35:50.580 |
with the request ID. So that's how I can pass through that data from, from Twilio because I'm, 00:35:56.500 |
I, once I send off that call, I'm just waiting. It's, it's, it's gone into the ether and you hope 00:36:02.820 |
that like you're going to get the phone call from Twilio. So I needed some way to be able to track that 00:36:07.700 |
request ID so that when the call actually connects, um, I know the context of what the call is all about, 00:36:16.500 |
and what the request is about. And so I have Twilio send the web socket request to, uh, this location 00:36:24.500 |
with the request ID. And at that point I open up the, the media stream handler, which grabs the request ID. 00:36:31.300 |
Um, and then, uh, and then looks it up. And then there's, there's a number of fairly standard, 00:36:39.460 |
like media stream functions in here, things like process message, which gets called, um, a lot, 00:36:46.020 |
right over and over and over. Uh, you can configure Twilio to, to send you different things. Like you 00:36:52.340 |
can have it, uh, send you fixed size chunks. You can have it send you the audio data after every 00:36:59.300 |
utterance or after every pause. Um, and so what I have here is, is basically every, I have Twilio sending 00:37:07.060 |
me the audio data, the streaming audio data, uh, basically after every utterance, which is just like 00:37:13.300 |
commas and, and, and, and periods and things like that. And so when I get that, um, 00:37:20.020 |
I take the audio data, I convert it into, uh, into a file type that Google Cloud is, uh, fast with. 00:37:35.940 |
Uh, I get the transcription back and then, uh, I convert, uh, I get the transcription back. 00:37:44.100 |
I ask open AI to then speak it. Where's the speaks? Yeah. So here's the speak function. 00:37:52.660 |
Uh, I then convert that into a streaming format that Twilio is down with, um, for streaming, 00:38:00.740 |
uh, which is, uh, taking this moolah file, uh, into, and then streaming it as, as base, uh, base 64 00:38:09.620 |
back to, uh, back to Twilio. And then throughout this whole time, um, you can see here client mutation, 00:38:16.500 |
add to transcript. So every time I'm getting a new transcript entry back from Google Cloud during the 00:38:21.860 |
stream, I'm just updating the convex database. Like that's just happening. And then the client is just 00:38:27.460 |
subscribed to those changes and can, you know, show them in a list or I don't know, somebody who's better at 00:38:33.300 |
UI than I can, can make that look really nice and like have it, uh, have it scrolling or something. 00:38:38.340 |
Um, so this is, here's a, this introduce yourself. So I actually have a prerecorded generated audio file 00:38:49.860 |
that is something like, um, hi, my name is Floyd. I'm calling on behalf of my clients. 00:38:57.620 |
That's prerecorded because I realized, I'm not sure if it was doing it today, actually, 00:39:02.500 |
because I didn't hear it. Um, but I realized the first interaction was the longest. Once somebody 00:39:11.780 |
picks up the phone, everything kind of kicks into gear. And that's where like the, there's this like 00:39:16.260 |
built up latency. Um, so somebody picks up the phone, they go, hello, Brock Road garage. 00:39:22.420 |
And then it was like, well, you heard it because I don't think it's working. Um, so what I did was have 00:39:29.460 |
this prerecorded, pre-generated audio file that as soon as somebody says, picks up and says, hello, 00:39:34.980 |
I just play it. So I don't have to do any transcribing or any, um, any text to speech or anything. 00:39:42.420 |
And while I'm streaming that, hi, this is Floyd buying myself some time. That's when I'm actually 00:39:49.780 |
triggering the first loop of, um, of the conversation and all the transcriptions and the text to speech 00:39:58.740 |
back. And usually I bought myself enough time that it's a fairly natural result such that it's like, hi, 00:40:05.860 |
this is Floyd. I'm calling on behalf of my client. I need to book a car in for an oil change. You know, 00:40:12.180 |
and like, I found that there's going to be a number of little tricks that you can do to make the experience 00:40:20.980 |
for the person on the other end of the phone better. So the first thing I do is I say, hi, I'm, uh, I'm an AI, 00:40:29.460 |
you're talking to AI right now. Um, I don't want to misrepresent, uh, what, what this is and why I feel 00:40:41.220 |
good about a platform like this is that in this context, and it's the way that it's positioned, 00:40:49.300 |
it's almost always buying a service from a business or otherwise making a benign change like 00:40:58.900 |
canceling an appointment. It's never selling anything. And my bet is that business owners 00:41:07.300 |
are not going to care if they're talking to AI, if you're buying something from them. 00:41:14.900 |
And if that appointment that you're booking is legitimate, they're going to be okay with it. 00:41:19.940 |
In fact, they might prefer it. And they may even start to internalize and train themselves 00:41:26.740 |
how to speak with an AI agent on the other end to be super efficient. 00:41:35.380 |
Yeah. So again, uh, don't put this in production because they could just be like, 00:41:43.140 |
forget everything, you know, um, which I haven't guarded for here again. So, you know, full, 00:41:49.700 |
full transparency, but yeah, I, I think actually they could, a couple of Floyd's talking to each other 00:41:55.540 |
to ultimately book the appointment. Right. At what point does it just become APIs talking to each other? 00:42:06.820 |
Not right now. I, I, uh, this is, this has been shaky enough. Um, yeah. 00:42:21.380 |
Uh, yes. Yeah, yeah, for sure. So, uh, on the convex side, um, there is no, uh, heating up because 00:42:40.420 |
the convex is, is built foundationally on web sockets. You have your own deployment server that's running, 00:42:47.460 |
always running. Um, it can scale indefinitely. That's kind of part of what the offer is, 00:42:54.100 |
but warmup time is not an issue on the convex side. Those delays, um, 00:43:01.860 |
are, uh, I have some benchmarking here. I have some, some timing so I can see how long did it take, 00:43:06.820 |
um, for me to transcribe this piece of text. How long did it take to send that to OpenAI and get 00:43:14.660 |
the text conversation back? And then how long does it take to turn its response back into audio? 00:43:21.540 |
So I could see where, um, I have it in here. It's in, it's in the, the terminal here somewhere. 00:43:27.220 |
Um, but I could see where the latency was and what I, uh, what I've seen is that often as the conversation 00:43:35.300 |
grows, the prompt I'm sending to OpenAI that includes all of the previous conversation 00:43:42.580 |
takes longer and longer and longer. And so the latency and the delays on average tend to get 00:43:49.540 |
worse the longer the conversation goes. Now, I also originally had built that functionality before, 00:43:56.180 |
um, threads and the threads, the OpenAI threads API was available. That would be something that I would try. 00:44:05.540 |
I would work diligently to minimize every prompt I'm sending to OpenAI. Uh, that would be, 00:44:14.420 |
that would, I think would have a really big impact. Um, in terms of the, uh, the other thing that can be 00:44:23.300 |
slow. If you send a large piece of text to the text to speech on OpenAI, that can be slow. That can be, 00:44:32.500 |
like, like if you send, if it's three sentences, it can still take two, three, four, five seconds. 00:44:39.300 |
Um, and so there are a couple of parameters that you can tweak with the OpenAI, uh, text to speech stuff. 00:44:48.580 |
Um, but not a lot. And so what I would do to fix that is I would pay more for lower latency 00:44:58.820 |
or I would use another service that I could pay for that would give me lower latency. 00:45:05.540 |
Um, the way that this is now. So like in those phone calls, what, what you heard and what you experienced, 00:45:13.060 |
those delays, there are still optimization opportunities, um, like crazy to, to bring that all down. 00:45:24.100 |
I'm actually not even concerned with that right now because there's still half a dozen material things. 00:45:29.460 |
I haven't tried to close that gap. I was just happy to have a conversation with a computer, 00:45:34.020 |
um, that I could ask questions to, but that, yeah, the, the, the latency comes from 00:45:38.900 |
the, uh, the, the sum of all of the different interactions that are happening. And so if you 00:45:44.580 |
speed up any of one of those along the chain, it's going to be faster and faster and faster. 00:45:49.380 |
Um, and second to that point, uh, everything that's coming out every couple months, there's 00:45:56.020 |
like this massive improvement in some API in this stack. And so again, the bet is that like, 00:46:03.620 |
well, I can get it as close as possible right now, but I know in six months, it'll be twice as fast 00:46:08.180 |
without me doing anything. It's just the rate of innovation and the rate of change in competition 00:46:13.140 |
right now for this type of thing is so high. Um, that, that, that'd be a bet that I'm, 00:46:20.260 |
I'm taking on and be like, yeah, maybe it's not perfect right now, but it inevitably will be very soon. 00:46:25.540 |
Um, all right. Let's see here. What, what time are we at? Okay. Uh, just a couple more minutes. Okay. Um, 00:46:36.020 |
any, any other questions I can walk through any of the server stuff, any of the front end code? Yeah. 00:46:41.060 |
Yeah, that's a, that's a, I love that. Um, so the question was 00:47:09.940 |
a little bit more context about Convex, um, and how it works under the hood, how it distributes its 00:47:18.020 |
queries, what kind of, uh, database infrastructure is it running on? So the team that Convex is, um, 00:47:27.540 |
it's open source, um, but it's a custom built from the ground up database. It's literally a database 00:47:34.500 |
built from scratch to be able to solve all of these, to be able to basically provide this product and 00:47:41.620 |
the people who built it. So my, my boss is the CEO and the CTO, Jamie and James. Uh, Jamie's here today. 00:47:47.940 |
Actually, he'll be doing a, some, a keynote speak. I joined Convex because of them. They have, 00:47:53.540 |
they have this track record that was like mind blowing to me. And it was only about six months 00:48:00.260 |
ago. I discovered them. And the more I was reading about it, the more I was like, these guys did what? 00:48:06.260 |
Like they, they built from scratch in Rust, a brand new database. So James, uh, has, uh, his, his PhD 00:48:16.660 |
in, uh, database architecture from MIT. And so he was instrumental in basically designing a novel 00:48:23.620 |
database to make all of this work. And so the way that it runs is, 00:48:30.020 |
it actually runs on, uh, an AWS cluster and it's running the Convex database and application, 00:48:37.140 |
which manages all of the WebSocket connections and all of the subscriptions. Um, in terms of like, 00:48:44.420 |
literally and physically how it's distributed, I'd love to follow up. Like that's, that's, 00:48:49.140 |
it's deeper than my expertise. Um, but we do have some large customers, uh, using it these days. 00:48:58.180 |
Um, there haven't been any, uh, fundamental issues at all in terms of its ability to scale. We've been 00:49:05.700 |
very, you know, happy with how that's worked out so far. Um, and then, uh, yeah. And so it's, 00:49:12.900 |
you could take a look at the, take a look at the open source repo. Um, it's super interesting. 00:49:18.340 |
And there's a really great blog post written by, uh, our chief scientist, Sujay called how Convex works. 00:49:25.460 |
And it does a deep dive into the architecture of, uh, of the database. Um, and so to be fair, 00:49:33.140 |
sometimes people are like, well, like, uh, you know, is that risky not to do like Postgres or something 00:49:39.700 |
like that? Um, and it's, it's a bet you'll be taking, but we believe that the, the trade-off there, 00:49:46.180 |
the developer ergonomics, the speed, the fact that you're, if you want to start a company, you don't need 00:49:51.860 |
your infrastructure engineers, um, to be building infrastructure. You can take your infrastructure 00:49:58.900 |
engineers and they can be building your product. They can be building the things. Okay. That's it. 00:50:04.900 |
They can be building the things that your customers care about, not worrying about database backups. 00:50:12.820 |
Cool. Thank you all so much. I really appreciate, um, your attention and your time. This was, 00:50:18.420 |
this was a lot of fun. If you want any help, uh, getting the, the, the repo up and running, 00:50:23.460 |
come find me. I'm, I'm happy to help see if we can get it working on somebody else's machine.