back to index

AI Voice Assistants with OpenAI's Agents SDK | Full Tutorial + Code


Chapters

0:0 AI Voice Assistants
0:58 Getting the Code
2:19 Handling Audio in Python
6:56 Agents SDK Voice Pipeline
11:2 Speaking to the Agent
13:38 Chat with Voice Agent
15:31 Voice Agents Conclusion

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we are going to introduce how to use voice within agents SDK. Now agents SDK if you don't
00:00:07.340 | already know is OpenAI's new AI framework and when I talk about voice I am speaking about
00:00:15.220 | building a voice interface with which you communicate with your agents. Now voice as a
00:00:22.880 | topic is very broad and there are various ways of implementing it. One of those is through agents
00:00:29.980 | SDK. Agents SDK provides various voice features that just helps us build a voice agent quite easily
00:00:38.920 | and in my opinion would act probably more as a introduction to voice agents before you potentially
00:00:45.960 | step into something more advanced such as like it or daily. That being said you can do a lot already
00:00:53.560 | with just voice in agents SDK so I definitely recommend starting with this. Now to get started
00:00:59.160 | we're going to be using one of the chapters from an upcoming course that we're putting together
00:01:04.100 | which is focused on agents SDK. One of these chapters is on voice and this is sort of a
00:01:11.820 | earlier mostly finished draft. So the way that we would use this is either you can install everything
00:01:20.420 | through a pip install like so or the way that I would recommend doing this is actually to go ahead and
00:01:27.240 | git clone. So I'll git clone that repo navigate into that repo and what you'll first want to do is make
00:01:36.020 | sure you have uv installed. Then you want to run uv vns and you set a Python version to use. It doesn't have to be 3.12.7 but that's what I would recommend just to
00:01:49.360 | essentially align your environment with the one I am using here. Then I would activate that environment
00:01:56.980 | and you see now that I'm on Python 12.7 and agents SDK course which is the environment name
00:02:03.740 | and I would do uv sync to just ensure that you have all of the prerequisites or the latest prerequisites
00:02:10.020 | installed. Now once you've done that you can navigate to your code editor whatever you are using and open the repo
00:02:17.360 | and go to voice. Now I'm just going to take you through everything we need to know to get started with voice
00:02:23.700 | in Python. So the very first thing is actually not specific to agents SDK. We first need to understand
00:02:30.760 | okay how do we handle voice or how do we handle audio in Python? How do we record audio? How do we play it back?
00:02:38.980 | So first you're going to jump into that. Now in our notebook you're going to make sure I have the
00:02:44.620 | correct kernel installed. It should be it should already preload here although you may have to go
00:02:49.820 | into select a kernel and then find the vnbin Python. And then what we're going to do is take a look at the
00:02:57.360 | sound device library. So this is the library that we are using to handle the more handle audio in Python.
00:03:04.660 | So what we will need to do before recording or playing audio is get the sample rate of our input
00:03:13.780 | and output devices. And we do that with sound device query devices and then we want to look for input
00:03:21.140 | devices and output devices. Now if we run this we should see something like this especially if you're on
00:03:26.620 | Macbook so I can see that I have my microphone and I can see that I have my speakers. Okay and then the sample rate
00:03:33.880 | we're going to see is down here. So we're going to pull that in. So yep that's all I'm doing here. Pulling in that
00:03:40.640 | in sample rate and out sample rate. And then what we're going to do here is we're actually creating this input stream.
00:03:47.740 | is what we're streaming our audio input into this object. And that is going to continue until we press
00:03:55.300 | enter because we've added this input function here. So let's go ahead and try that. Okay so you can see
00:04:01.160 | at the top here it says press enter to confirm your input or escape to cancel. So actually right now as I'm
00:04:07.780 | talking this is recording this is recording. So I'm going to press enter and I can come down here and see
00:04:14.780 | that we have these recorded chunks. So what this is doing is it's recording these chunks of audio each of
00:04:23.840 | those is a numpy rate and each one of those chunks is 512 values that represent a very small amount of time
00:04:33.800 | in our audio. So what we find is that we so right now we have 1401 of these chunks and then inside each
00:04:44.320 | one of those is a 512 element vector. Okay also the one that you see here is the number of audio channels
00:04:52.600 | that we have. My microphone is recording in mono not stereo. So because it's mono there is just one audio
00:05:00.060 | channel. If it was stereo there would be two audio channels so this would be two dimensions.
00:05:04.620 | So what we need to do is actually concatenate all of these and it will create a single like audio array
00:05:13.340 | here and we can actually play that back. So let me just open the audio
00:05:22.300 | and we can just play this back and see see what is in there.
00:05:26.380 | Okay so you see at the top here it says press enter to confirm your input or escape to cancel.
00:05:33.580 | Okay so actually right now as I'm talking this is recording. So I'm going to press enter.
00:05:42.860 | Okay so that was the recording from before. Now that you can see that here this ran straight away.
00:05:53.580 | So this cell it didn't wait for the audio to play. So what we can actually do here is we would write
00:05:59.180 | SD wait. And this will essentially the cell will complete once the audio completes.
00:06:06.940 | Now let's just come down and see what this audio looks like. So it's actually just going to be a waveform
00:06:14.860 | and we can see here. Okay so a pretty typical audio waveform there. So we have that.
00:06:21.100 | Now this is our audio file and what we need to do now is just transform it into this audio input type
00:06:30.940 | which is specific to Adrian's SDK. So this is the SDK's own like audio object where we'll be storing that
00:06:39.500 | audio file. Okay you can see it's just it's just an array. I don't think there's anything
00:06:44.300 | I don't think there's anything beyond it beyond what we already did with you know having a numpy array.
00:06:51.580 | I think it's just a a type that I've now used to handle that array. So we have that. Now we're moving on to
00:06:58.300 | all the voice pipeline stuff. So all the agent SDK stuff. Now obviously we will need an OpenAI API key
00:07:06.700 | at this point. So I'm going to run this and you'll need to get your OpenAI API key from
00:07:12.860 | platform.openai.com. And once you have your API key you're just going to paste it into the top here
00:07:20.300 | or wherever the little dialog box pops up for you and press center. Okay now we have our API key in there
00:07:28.140 | and now what we're going to do is initialize an agent through Agents SDK. So there's nothing
00:07:35.340 | nothing unique or new here. Okay all we are doing is initializing normal agent. The only thing that is
00:07:45.100 | is actually slightly different is I'm specifying to the assistant in the instructions so that developer
00:07:52.780 | or system prompt that the user is speaking to you via a voice interface. The reason I put that in there
00:08:01.260 | is because if I do not and I say to the agent this is I do this all the time can you hear me
00:08:08.700 | the agent will typically respond with no I cannot hear you but I can read what you are typing.
00:08:18.780 | Because the LLM as far as it's concerned is reading text it doesn't actually realize that that text is
00:08:25.100 | coming from a voice interface right it doesn't realize there's a there's a speech to text step
00:08:31.420 | happening in the middle there and it also doesn't realize that what it writes down is going to become
00:08:37.100 | a voice through a text to speech component afterwards. So it's important to mention that if you are building
00:08:44.620 | voice agents so we run that and then we're going to pass our agent into this single agent voice workflow.
00:08:54.540 | Great so we have that and now we will initialize our pipeline configuration. So the pipeline configuration
00:09:01.740 | is essentially saying okay well what are the settings that you want to use what's the configuration you
00:09:07.580 | want to use for your pipeline of voice. So the voice pipeline at a core like very basic level is going
00:09:16.700 | to be three main components so well there's obviously all the data transformation stuff that happens first
00:09:22.300 | then you're going to get to a speech to text component so that's going to convert your spoken audio into
00:09:30.220 | text. That text then gets passed to an LLM. In this case we specified that it is GPT 4.1 nano. That LLM is
00:09:38.460 | then going to generation text which goes to a text-to-speech model and then that spoken audio from the LLM
00:09:46.220 | is passed back to us. So we can set various parameters in here one of those is this text-to-speech
00:09:53.740 | model settings so in here you can essentially tell the text-to-speech model okay what should this spoken
00:10:01.420 | audio be like what the what what is your system prompt essentially as a text-to-speech model and this is
00:10:09.980 | actually from one of the examples that OpenAI provided either in one of their blogs or their docs I don't
00:10:18.460 | quite remember but this is a nice example of how you can use that okay so we give it a personality
00:10:24.060 | a tone we specify that it should be clear articulate and steady we have a temple and emotion okay so we
00:10:31.500 | can use all of that and we just pass that into this voice pipeline config object and once we have that
00:10:38.220 | voice pipeline config or configuration and we have our agent workflow we use both of those together to
00:10:46.060 | initialize our voice pipeline object and this is what we're actually going to be using but this is
00:10:51.740 | the pipeline this is what is taking our voice converting into text passing it to LLM and so on this is what
00:10:59.740 | we're using here so let's try and just pass the audio that we recorded before to this pipeline and just
00:11:09.740 | see what happens okay there are various ways that we can pass audio and I'm going to show you some more
00:11:14.060 | in a moment okay so pipeline run this is a sorry this is a async method so we do need to await it
00:11:21.740 | so we pass our audio input into pipeline run we await it and then what we need to do is iterate through all
00:11:30.300 | of the streamed chunks that we are receiving or streamed events so we simply do a async for loop here and we are
00:11:40.540 | looking at the stream that is coming out of this result okay and then we say because there are
00:11:47.420 | various event types that can be sent the only one that will contain audio that we want to be returning
00:11:53.900 | to the user is this voice stream event audio okay that's the only one that's important to us
00:12:00.220 | in in in this scenario so when we see that we append the event data to our response trunks which is an
00:12:08.060 | anti-list here and that is you know similar to before where we had those chunks of audio
00:12:13.180 | that we then concatenated this is the exact same thing these are chunks of audio and we need to concatenate
00:12:20.380 | to create a single audio file okay so then we can go ahead and play that now the one thing that we do
00:12:28.300 | need to be careful with here is the sample rate that open ai generates or outputs is not the 48 kilohertz
00:12:38.940 | that we saw from my both my input and output devices they have their own sample rate and that is 24 kilohertz
00:12:47.420 | so we just specify that here okay open ai has its own sample rate so when we're receiving audio from
00:12:53.900 | them that is what we need to pass into here in order to play that audio the correct speed if we if we don't
00:13:01.820 | i can show you it's uh it's pretty weird or it's just passed so let me turn up the audio again and we
00:13:08.700 | can play this you mentioned that you see something at the top here could you tell me what it says or describe it
00:13:13.020 | i'll do my best to assist you
00:13:15.820 | okay so yeah it's uh a little fast so we just turn this down to 24 and this is the correct sample rate
00:13:23.340 | got it at the top it says press enter to confirm how can i help you with this
00:13:30.300 | okay and this is just responding to you know when i was talking to you earlier so it doesn't really
00:13:36.860 | make a ton of sense but let's go ahead and just see how we can actually talk to this in a more
00:13:41.500 | responsive way so i've essentially wrapped all the logic that we had before that we've just been through
00:13:48.460 | into this async method here and it all happens within this while loop and essentially what this while
00:13:55.180 | loop is doing is saying okay we have that input parameter from before and when you press enter for
00:14:01.500 | that input parameter is going to start recording then what you have to do is press enter again to stop
00:14:07.580 | recording and the pipeline is going to respond to us once we're done and we want to finish the conversation
00:14:15.100 | we press q and enter and i'll show you how we do this and it will exit so let's go ahead and just try this
00:14:21.580 | so now you see i need to press enter to speak so i just press enter once hello can you hear me
00:14:28.140 | press enter again yes i can hear you in the sense that i can read your messages
00:14:36.700 | how can i assist you today yeah so you can actually see the uh the weird response it will give you if
00:14:44.060 | you're not prompting it well enough apparently i'm not prompting it well enough i thought i i was but
00:14:49.100 | anyway so i'm gonna speak again hi i'm actually talking to you through a voice interface i believe
00:14:58.940 | that the voice is being transformed into text which is what you're reading so i think you can hear me
00:15:10.700 | i understand you're speaking to me face to face and that your voice communication might feel more
00:15:16.300 | formal while i can't hear voice directly i process your tight messages if you'd like you can share more
00:15:23.420 | details or ask questions and i'll do my best to help okay and i'm just going to press q to ace it so
00:15:32.140 | seems a bit over overly enthusiastic to me and that prompting especially on okay you're actually in a
00:15:39.180 | voice interface needs a bit of work which is fine we can of course do that but the main part of this
00:15:45.420 | which is we are speaking and now agent is reading what we are speaking and responding back to us is
00:15:53.100 | working and it's actually working this is with all the other one that we're using here is duty 4.1 nano which
00:15:59.100 | is as far as i'm aware a very small model is not particularly performing which might also explain
00:16:04.220 | the lack of understanding from my system prompt but you can see very quickly we built this voice
00:16:12.620 | interface that we can speak to an lm through which i think is really cool and definitely going forwards
00:16:20.860 | i think the majority of the time that normal people and maybe even developers and i think just a very
00:16:29.980 | broad range of people interact with ai it's either going to be ai that you don't even notice is there
00:16:37.340 | or it's going to be ai that you are actually speaking to i don't think people are necessarily
00:16:41.580 | going to be typing all the time to ai because voice is just a much more natural way of speaking and
00:16:49.020 | it allows for a more fluid conversation which i think is the direction that we're going not for
00:16:54.940 | everything of course like code assistance and if i want code help maybe sometimes i'll talk to
00:17:00.380 | an agent majority time i'm probably going to be talking typing and i don't expect that to change
00:17:07.180 | necessarily but there are just so many modern maybe normal use cases where okay you want to just know
00:17:16.700 | something quickly and you want the agent to web search for you and maybe just chat to it very quickly
00:17:21.260 | and and talk through something or you want to practice your italian for example which i do all the time
00:17:27.260 | that is really good to use a voice interface for and there are just so many other cases like that so
00:17:34.220 | yeah i think this is very exciting and this is really a very quick way to get started with voice
00:17:41.100 | agents which i think is great in the future i'm definitely going to cover more on voice and as i
00:17:47.900 | mentioned this is this specific notebook is part of a broader course on the agent's sdk so i'm obviously
00:17:55.260 | going to be talking more about the agent's sdk very soon as well but for now i'll loot there i hope all this
00:18:01.980 | been useful and interesting so thank you very much for watching and i'll see you again in the next one bye
00:18:22.440 | We'll see you next time.