AI Voice Assistants with OpenAI's Agents SDK

00:00:00.000 | Today we are going to introduce how to use voice within agents SDK. Now agents SDK if you don't

00:00:07.340 | already know is OpenAI's new AI framework and when I talk about voice I am speaking about

00:00:15.220 | building a voice interface with which you communicate with your agents. Now voice as a

00:00:22.880 | topic is very broad and there are various ways of implementing it. One of those is through agents

00:00:29.980 | SDK. Agents SDK provides various voice features that just helps us build a voice agent quite easily

00:00:38.920 | and in my opinion would act probably more as a introduction to voice agents before you potentially

00:00:45.960 | step into something more advanced such as like it or daily. That being said you can do a lot already

00:00:53.560 | with just voice in agents SDK so I definitely recommend starting with this. Now to get started

00:00:59.160 | we're going to be using one of the chapters from an upcoming course that we're putting together

00:01:04.100 | which is focused on agents SDK. One of these chapters is on voice and this is sort of a

00:01:11.820 | earlier mostly finished draft. So the way that we would use this is either you can install everything

00:01:20.420 | through a pip install like so or the way that I would recommend doing this is actually to go ahead and

00:01:27.240 | git clone. So I'll git clone that repo navigate into that repo and what you'll first want to do is make

00:01:36.020 | sure you have uv installed. Then you want to run uv vns and you set a Python version to use. It doesn't have to be 3.12.7 but that's what I would recommend just to

00:01:49.360 | essentially align your environment with the one I am using here. Then I would activate that environment

00:01:56.980 | and you see now that I'm on Python 12.7 and agents SDK course which is the environment name

00:02:03.740 | and I would do uv sync to just ensure that you have all of the prerequisites or the latest prerequisites

00:02:10.020 | installed. Now once you've done that you can navigate to your code editor whatever you are using and open the repo

00:02:17.360 | and go to voice. Now I'm just going to take you through everything we need to know to get started with voice

00:02:23.700 | in Python. So the very first thing is actually not specific to agents SDK. We first need to understand

00:02:30.760 | okay how do we handle voice or how do we handle audio in Python? How do we record audio? How do we play it back?

00:02:38.980 | So first you're going to jump into that. Now in our notebook you're going to make sure I have the

00:02:44.620 | correct kernel installed. It should be it should already preload here although you may have to go

00:02:49.820 | into select a kernel and then find the vnbin Python. And then what we're going to do is take a look at the

00:02:57.360 | sound device library. So this is the library that we are using to handle the more handle audio in Python.

00:03:04.660 | So what we will need to do before recording or playing audio is get the sample rate of our input

00:03:13.780 | and output devices. And we do that with sound device query devices and then we want to look for input

00:03:21.140 | devices and output devices. Now if we run this we should see something like this especially if you're on

00:03:26.620 | Macbook so I can see that I have my microphone and I can see that I have my speakers. Okay and then the sample rate

00:03:33.880 | we're going to see is down here. So we're going to pull that in. So yep that's all I'm doing here. Pulling in that

00:03:40.640 | in sample rate and out sample rate. And then what we're going to do here is we're actually creating this input stream.

00:03:47.740 | is what we're streaming our audio input into this object. And that is going to continue until we press

00:03:55.300 | enter because we've added this input function here. So let's go ahead and try that. Okay so you can see

00:04:01.160 | at the top here it says press enter to confirm your input or escape to cancel. So actually right now as I'm

00:04:07.780 | talking this is recording this is recording. So I'm going to press enter and I can come down here and see

00:04:14.780 | that we have these recorded chunks. So what this is doing is it's recording these chunks of audio each of

00:04:23.840 | those is a numpy rate and each one of those chunks is 512 values that represent a very small amount of time

00:04:33.800 | in our audio. So what we find is that we so right now we have 1401 of these chunks and then inside each

00:04:44.320 | one of those is a 512 element vector. Okay also the one that you see here is the number of audio channels

00:04:52.600 | that we have. My microphone is recording in mono not stereo. So because it's mono there is just one audio

00:05:00.060 | channel. If it was stereo there would be two audio channels so this would be two dimensions.

00:05:04.620 | So what we need to do is actually concatenate all of these and it will create a single like audio array

00:05:13.340 | here and we can actually play that back. So let me just open the audio

00:05:22.300 | and we can just play this back and see see what is in there.

00:05:26.380 | Okay so you see at the top here it says press enter to confirm your input or escape to cancel.

00:05:33.580 | Okay so actually right now as I'm talking this is recording. So I'm going to press enter.

00:05:42.860 | Okay so that was the recording from before. Now that you can see that here this ran straight away.

00:05:53.580 | So this cell it didn't wait for the audio to play. So what we can actually do here is we would write

00:05:59.180 | SD wait. And this will essentially the cell will complete once the audio completes.

00:06:06.940 | Now let's just come down and see what this audio looks like. So it's actually just going to be a waveform

00:06:14.860 | and we can see here. Okay so a pretty typical audio waveform there. So we have that.

00:06:21.100 | Now this is our audio file and what we need to do now is just transform it into this audio input type

00:06:30.940 | which is specific to Adrian's SDK. So this is the SDK's own like audio object where we'll be storing that

00:06:39.500 | audio file. Okay you can see it's just it's just an array. I don't think there's anything

00:06:44.300 | I don't think there's anything beyond it beyond what we already did with you know having a numpy array.

00:06:51.580 | I think it's just a a type that I've now used to handle that array. So we have that. Now we're moving on to

00:06:58.300 | all the voice pipeline stuff. So all the agent SDK stuff. Now obviously we will need an OpenAI API key

00:07:06.700 | at this point. So I'm going to run this and you'll need to get your OpenAI API key from

00:07:12.860 | platform.openai.com. And once you have your API key you're just going to paste it into the top here

00:07:20.300 | or wherever the little dialog box pops up for you and press center. Okay now we have our API key in there

00:07:28.140 | and now what we're going to do is initialize an agent through Agents SDK. So there's nothing

00:07:35.340 | nothing unique or new here. Okay all we are doing is initializing normal agent. The only thing that is

00:07:45.100 | is actually slightly different is I'm specifying to the assistant in the instructions so that developer

00:07:52.780 | or system prompt that the user is speaking to you via a voice interface. The reason I put that in there

00:08:01.260 | is because if I do not and I say to the agent this is I do this all the time can you hear me

00:08:08.700 | the agent will typically respond with no I cannot hear you but I can read what you are typing.

00:08:18.780 | Because the LLM as far as it's concerned is reading text it doesn't actually realize that that text is

00:08:25.100 | coming from a voice interface right it doesn't realize there's a there's a speech to text step

00:08:31.420 | happening in the middle there and it also doesn't realize that what it writes down is going to become

00:08:37.100 | a voice through a text to speech component afterwards. So it's important to mention that if you are building

00:08:44.620 | voice agents so we run that and then we're going to pass our agent into this single agent voice workflow.

00:08:54.540 | Great so we have that and now we will initialize our pipeline configuration. So the pipeline configuration

00:09:01.740 | is essentially saying okay well what are the settings that you want to use what's the configuration you

00:09:07.580 | want to use for your pipeline of voice. So the voice pipeline at a core like very basic level is going

00:09:16.700 | to be three main components so well there's obviously all the data transformation stuff that happens first

00:09:22.300 | then you're going to get to a speech to text component so that's going to convert your spoken audio into

00:09:30.220 | text. That text then gets passed to an LLM. In this case we specified that it is GPT 4.1 nano. That LLM is

00:09:38.460 | then going to generation text which goes to a text-to-speech model and then that spoken audio from the LLM

00:09:46.220 | is passed back to us. So we can set various parameters in here one of those is this text-to-speech

00:09:53.740 | model settings so in here you can essentially tell the text-to-speech model okay what should this spoken

00:10:01.420 | audio be like what the what what is your system prompt essentially as a text-to-speech model and this is

00:10:09.980 | actually from one of the examples that OpenAI provided either in one of their blogs or their docs I don't

00:10:18.460 | quite remember but this is a nice example of how you can use that okay so we give it a personality

00:10:24.060 | a tone we specify that it should be clear articulate and steady we have a temple and emotion okay so we

00:10:31.500 | can use all of that and we just pass that into this voice pipeline config object and once we have that

00:10:38.220 | voice pipeline config or configuration and we have our agent workflow we use both of those together to

00:10:46.060 | initialize our voice pipeline object and this is what we're actually going to be using but this is

00:10:51.740 | the pipeline this is what is taking our voice converting into text passing it to LLM and so on this is what

00:10:59.740 | we're using here so let's try and just pass the audio that we recorded before to this pipeline and just

00:11:09.740 | see what happens okay there are various ways that we can pass audio and I'm going to show you some more

00:11:14.060 | in a moment okay so pipeline run this is a sorry this is a async method so we do need to await it

00:11:21.740 | so we pass our audio input into pipeline run we await it and then what we need to do is iterate through all

00:11:30.300 | of the streamed chunks that we are receiving or streamed events so we simply do a async for loop here and we are

00:11:40.540 | looking at the stream that is coming out of this result okay and then we say because there are

00:11:47.420 | various event types that can be sent the only one that will contain audio that we want to be returning

00:11:53.900 | to the user is this voice stream event audio okay that's the only one that's important to us

00:12:00.220 | in in in this scenario so when we see that we append the event data to our response trunks which is an

00:12:08.060 | anti-list here and that is you know similar to before where we had those chunks of audio

00:12:13.180 | that we then concatenated this is the exact same thing these are chunks of audio and we need to concatenate

00:12:20.380 | to create a single audio file okay so then we can go ahead and play that now the one thing that we do

00:12:28.300 | need to be careful with here is the sample rate that open ai generates or outputs is not the 48 kilohertz

00:12:38.940 | that we saw from my both my input and output devices they have their own sample rate and that is 24 kilohertz

00:12:47.420 | so we just specify that here okay open ai has its own sample rate so when we're receiving audio from

00:12:53.900 | them that is what we need to pass into here in order to play that audio the correct speed if we if we don't

00:13:01.820 | i can show you it's uh it's pretty weird or it's just passed so let me turn up the audio again and we

00:13:08.700 | can play this you mentioned that you see something at the top here could you tell me what it says or describe it

00:13:13.020 | i'll do my best to assist you

00:13:15.820 | okay so yeah it's uh a little fast so we just turn this down to 24 and this is the correct sample rate

00:13:23.340 | got it at the top it says press enter to confirm how can i help you with this

00:13:30.300 | okay and this is just responding to you know when i was talking to you earlier so it doesn't really

00:13:36.860 | make a ton of sense but let's go ahead and just see how we can actually talk to this in a more

00:13:41.500 | responsive way so i've essentially wrapped all the logic that we had before that we've just been through

00:13:48.460 | into this async method here and it all happens within this while loop and essentially what this while

00:13:55.180 | loop is doing is saying okay we have that input parameter from before and when you press enter for

00:14:01.500 | that input parameter is going to start recording then what you have to do is press enter again to stop

00:14:07.580 | recording and the pipeline is going to respond to us once we're done and we want to finish the conversation

00:14:15.100 | we press q and enter and i'll show you how we do this and it will exit so let's go ahead and just try this

00:14:21.580 | so now you see i need to press enter to speak so i just press enter once hello can you hear me

00:14:28.140 | press enter again yes i can hear you in the sense that i can read your messages

00:14:36.700 | how can i assist you today yeah so you can actually see the uh the weird response it will give you if

00:14:44.060 | you're not prompting it well enough apparently i'm not prompting it well enough i thought i i was but

00:14:49.100 | anyway so i'm gonna speak again hi i'm actually talking to you through a voice interface i believe

00:14:58.940 | that the voice is being transformed into text which is what you're reading so i think you can hear me

00:15:10.700 | i understand you're speaking to me face to face and that your voice communication might feel more

00:15:16.300 | formal while i can't hear voice directly i process your tight messages if you'd like you can share more

00:15:23.420 | details or ask questions and i'll do my best to help okay and i'm just going to press q to ace it so

00:15:32.140 | seems a bit over overly enthusiastic to me and that prompting especially on okay you're actually in a

00:15:39.180 | voice interface needs a bit of work which is fine we can of course do that but the main part of this

00:15:45.420 | which is we are speaking and now agent is reading what we are speaking and responding back to us is

00:15:53.100 | working and it's actually working this is with all the other one that we're using here is duty 4.1 nano which

00:15:59.100 | is as far as i'm aware a very small model is not particularly performing which might also explain

00:16:04.220 | the lack of understanding from my system prompt but you can see very quickly we built this voice

00:16:12.620 | interface that we can speak to an lm through which i think is really cool and definitely going forwards

00:16:20.860 | i think the majority of the time that normal people and maybe even developers and i think just a very

00:16:29.980 | broad range of people interact with ai it's either going to be ai that you don't even notice is there

00:16:37.340 | or it's going to be ai that you are actually speaking to i don't think people are necessarily

00:16:41.580 | going to be typing all the time to ai because voice is just a much more natural way of speaking and

00:16:49.020 | it allows for a more fluid conversation which i think is the direction that we're going not for

00:16:54.940 | everything of course like code assistance and if i want code help maybe sometimes i'll talk to

00:17:00.380 | an agent majority time i'm probably going to be talking typing and i don't expect that to change

00:17:07.180 | necessarily but there are just so many modern maybe normal use cases where okay you want to just know

00:17:16.700 | something quickly and you want the agent to web search for you and maybe just chat to it very quickly

00:17:21.260 | and and talk through something or you want to practice your italian for example which i do all the time

00:17:27.260 | that is really good to use a voice interface for and there are just so many other cases like that so

00:17:34.220 | yeah i think this is very exciting and this is really a very quick way to get started with voice

00:17:41.100 | agents which i think is great in the future i'm definitely going to cover more on voice and as i

00:17:47.900 | mentioned this is this specific notebook is part of a broader course on the agent's sdk so i'm obviously

00:17:55.260 | going to be talking more about the agent's sdk very soon as well but for now i'll loot there i hope all this

00:18:01.980 | been useful and interesting so thank you very much for watching and i'll see you again in the next one bye

00:18:18.940 | you

00:18:19.440 | you

00:18:19.940 | you

00:18:20.440 | you

00:18:20.940 | you

00:18:21.440 | you

00:18:21.940 | you

00:18:22.440 | We'll see you next time.

AI Voice Assistants with OpenAI's Agents SDK | Full Tutorial + Code

Chapters