back to indexAI Voice Assistants with OpenAI's Agents SDK | Full Tutorial + Code

Chapters
0:0 AI Voice Assistants
0:58 Getting the Code
2:19 Handling Audio in Python
6:56 Agents SDK Voice Pipeline
11:2 Speaking to the Agent
13:38 Chat with Voice Agent
15:31 Voice Agents Conclusion
00:00:00.000 |
Today we are going to introduce how to use voice within agents SDK. Now agents SDK if you don't 00:00:07.340 |
already know is OpenAI's new AI framework and when I talk about voice I am speaking about 00:00:15.220 |
building a voice interface with which you communicate with your agents. Now voice as a 00:00:22.880 |
topic is very broad and there are various ways of implementing it. One of those is through agents 00:00:29.980 |
SDK. Agents SDK provides various voice features that just helps us build a voice agent quite easily 00:00:38.920 |
and in my opinion would act probably more as a introduction to voice agents before you potentially 00:00:45.960 |
step into something more advanced such as like it or daily. That being said you can do a lot already 00:00:53.560 |
with just voice in agents SDK so I definitely recommend starting with this. Now to get started 00:00:59.160 |
we're going to be using one of the chapters from an upcoming course that we're putting together 00:01:04.100 |
which is focused on agents SDK. One of these chapters is on voice and this is sort of a 00:01:11.820 |
earlier mostly finished draft. So the way that we would use this is either you can install everything 00:01:20.420 |
through a pip install like so or the way that I would recommend doing this is actually to go ahead and 00:01:27.240 |
git clone. So I'll git clone that repo navigate into that repo and what you'll first want to do is make 00:01:36.020 |
sure you have uv installed. Then you want to run uv vns and you set a Python version to use. It doesn't have to be 3.12.7 but that's what I would recommend just to 00:01:49.360 |
essentially align your environment with the one I am using here. Then I would activate that environment 00:01:56.980 |
and you see now that I'm on Python 12.7 and agents SDK course which is the environment name 00:02:03.740 |
and I would do uv sync to just ensure that you have all of the prerequisites or the latest prerequisites 00:02:10.020 |
installed. Now once you've done that you can navigate to your code editor whatever you are using and open the repo 00:02:17.360 |
and go to voice. Now I'm just going to take you through everything we need to know to get started with voice 00:02:23.700 |
in Python. So the very first thing is actually not specific to agents SDK. We first need to understand 00:02:30.760 |
okay how do we handle voice or how do we handle audio in Python? How do we record audio? How do we play it back? 00:02:38.980 |
So first you're going to jump into that. Now in our notebook you're going to make sure I have the 00:02:44.620 |
correct kernel installed. It should be it should already preload here although you may have to go 00:02:49.820 |
into select a kernel and then find the vnbin Python. And then what we're going to do is take a look at the 00:02:57.360 |
sound device library. So this is the library that we are using to handle the more handle audio in Python. 00:03:04.660 |
So what we will need to do before recording or playing audio is get the sample rate of our input 00:03:13.780 |
and output devices. And we do that with sound device query devices and then we want to look for input 00:03:21.140 |
devices and output devices. Now if we run this we should see something like this especially if you're on 00:03:26.620 |
Macbook so I can see that I have my microphone and I can see that I have my speakers. Okay and then the sample rate 00:03:33.880 |
we're going to see is down here. So we're going to pull that in. So yep that's all I'm doing here. Pulling in that 00:03:40.640 |
in sample rate and out sample rate. And then what we're going to do here is we're actually creating this input stream. 00:03:47.740 |
is what we're streaming our audio input into this object. And that is going to continue until we press 00:03:55.300 |
enter because we've added this input function here. So let's go ahead and try that. Okay so you can see 00:04:01.160 |
at the top here it says press enter to confirm your input or escape to cancel. So actually right now as I'm 00:04:07.780 |
talking this is recording this is recording. So I'm going to press enter and I can come down here and see 00:04:14.780 |
that we have these recorded chunks. So what this is doing is it's recording these chunks of audio each of 00:04:23.840 |
those is a numpy rate and each one of those chunks is 512 values that represent a very small amount of time 00:04:33.800 |
in our audio. So what we find is that we so right now we have 1401 of these chunks and then inside each 00:04:44.320 |
one of those is a 512 element vector. Okay also the one that you see here is the number of audio channels 00:04:52.600 |
that we have. My microphone is recording in mono not stereo. So because it's mono there is just one audio 00:05:00.060 |
channel. If it was stereo there would be two audio channels so this would be two dimensions. 00:05:04.620 |
So what we need to do is actually concatenate all of these and it will create a single like audio array 00:05:13.340 |
here and we can actually play that back. So let me just open the audio 00:05:22.300 |
and we can just play this back and see see what is in there. 00:05:26.380 |
Okay so you see at the top here it says press enter to confirm your input or escape to cancel. 00:05:33.580 |
Okay so actually right now as I'm talking this is recording. So I'm going to press enter. 00:05:42.860 |
Okay so that was the recording from before. Now that you can see that here this ran straight away. 00:05:53.580 |
So this cell it didn't wait for the audio to play. So what we can actually do here is we would write 00:05:59.180 |
SD wait. And this will essentially the cell will complete once the audio completes. 00:06:06.940 |
Now let's just come down and see what this audio looks like. So it's actually just going to be a waveform 00:06:14.860 |
and we can see here. Okay so a pretty typical audio waveform there. So we have that. 00:06:21.100 |
Now this is our audio file and what we need to do now is just transform it into this audio input type 00:06:30.940 |
which is specific to Adrian's SDK. So this is the SDK's own like audio object where we'll be storing that 00:06:39.500 |
audio file. Okay you can see it's just it's just an array. I don't think there's anything 00:06:44.300 |
I don't think there's anything beyond it beyond what we already did with you know having a numpy array. 00:06:51.580 |
I think it's just a a type that I've now used to handle that array. So we have that. Now we're moving on to 00:06:58.300 |
all the voice pipeline stuff. So all the agent SDK stuff. Now obviously we will need an OpenAI API key 00:07:06.700 |
at this point. So I'm going to run this and you'll need to get your OpenAI API key from 00:07:12.860 |
platform.openai.com. And once you have your API key you're just going to paste it into the top here 00:07:20.300 |
or wherever the little dialog box pops up for you and press center. Okay now we have our API key in there 00:07:28.140 |
and now what we're going to do is initialize an agent through Agents SDK. So there's nothing 00:07:35.340 |
nothing unique or new here. Okay all we are doing is initializing normal agent. The only thing that is 00:07:45.100 |
is actually slightly different is I'm specifying to the assistant in the instructions so that developer 00:07:52.780 |
or system prompt that the user is speaking to you via a voice interface. The reason I put that in there 00:08:01.260 |
is because if I do not and I say to the agent this is I do this all the time can you hear me 00:08:08.700 |
the agent will typically respond with no I cannot hear you but I can read what you are typing. 00:08:18.780 |
Because the LLM as far as it's concerned is reading text it doesn't actually realize that that text is 00:08:25.100 |
coming from a voice interface right it doesn't realize there's a there's a speech to text step 00:08:31.420 |
happening in the middle there and it also doesn't realize that what it writes down is going to become 00:08:37.100 |
a voice through a text to speech component afterwards. So it's important to mention that if you are building 00:08:44.620 |
voice agents so we run that and then we're going to pass our agent into this single agent voice workflow. 00:08:54.540 |
Great so we have that and now we will initialize our pipeline configuration. So the pipeline configuration 00:09:01.740 |
is essentially saying okay well what are the settings that you want to use what's the configuration you 00:09:07.580 |
want to use for your pipeline of voice. So the voice pipeline at a core like very basic level is going 00:09:16.700 |
to be three main components so well there's obviously all the data transformation stuff that happens first 00:09:22.300 |
then you're going to get to a speech to text component so that's going to convert your spoken audio into 00:09:30.220 |
text. That text then gets passed to an LLM. In this case we specified that it is GPT 4.1 nano. That LLM is 00:09:38.460 |
then going to generation text which goes to a text-to-speech model and then that spoken audio from the LLM 00:09:46.220 |
is passed back to us. So we can set various parameters in here one of those is this text-to-speech 00:09:53.740 |
model settings so in here you can essentially tell the text-to-speech model okay what should this spoken 00:10:01.420 |
audio be like what the what what is your system prompt essentially as a text-to-speech model and this is 00:10:09.980 |
actually from one of the examples that OpenAI provided either in one of their blogs or their docs I don't 00:10:18.460 |
quite remember but this is a nice example of how you can use that okay so we give it a personality 00:10:24.060 |
a tone we specify that it should be clear articulate and steady we have a temple and emotion okay so we 00:10:31.500 |
can use all of that and we just pass that into this voice pipeline config object and once we have that 00:10:38.220 |
voice pipeline config or configuration and we have our agent workflow we use both of those together to 00:10:46.060 |
initialize our voice pipeline object and this is what we're actually going to be using but this is 00:10:51.740 |
the pipeline this is what is taking our voice converting into text passing it to LLM and so on this is what 00:10:59.740 |
we're using here so let's try and just pass the audio that we recorded before to this pipeline and just 00:11:09.740 |
see what happens okay there are various ways that we can pass audio and I'm going to show you some more 00:11:14.060 |
in a moment okay so pipeline run this is a sorry this is a async method so we do need to await it 00:11:21.740 |
so we pass our audio input into pipeline run we await it and then what we need to do is iterate through all 00:11:30.300 |
of the streamed chunks that we are receiving or streamed events so we simply do a async for loop here and we are 00:11:40.540 |
looking at the stream that is coming out of this result okay and then we say because there are 00:11:47.420 |
various event types that can be sent the only one that will contain audio that we want to be returning 00:11:53.900 |
to the user is this voice stream event audio okay that's the only one that's important to us 00:12:00.220 |
in in in this scenario so when we see that we append the event data to our response trunks which is an 00:12:08.060 |
anti-list here and that is you know similar to before where we had those chunks of audio 00:12:13.180 |
that we then concatenated this is the exact same thing these are chunks of audio and we need to concatenate 00:12:20.380 |
to create a single audio file okay so then we can go ahead and play that now the one thing that we do 00:12:28.300 |
need to be careful with here is the sample rate that open ai generates or outputs is not the 48 kilohertz 00:12:38.940 |
that we saw from my both my input and output devices they have their own sample rate and that is 24 kilohertz 00:12:47.420 |
so we just specify that here okay open ai has its own sample rate so when we're receiving audio from 00:12:53.900 |
them that is what we need to pass into here in order to play that audio the correct speed if we if we don't 00:13:01.820 |
i can show you it's uh it's pretty weird or it's just passed so let me turn up the audio again and we 00:13:08.700 |
can play this you mentioned that you see something at the top here could you tell me what it says or describe it 00:13:15.820 |
okay so yeah it's uh a little fast so we just turn this down to 24 and this is the correct sample rate 00:13:23.340 |
got it at the top it says press enter to confirm how can i help you with this 00:13:30.300 |
okay and this is just responding to you know when i was talking to you earlier so it doesn't really 00:13:36.860 |
make a ton of sense but let's go ahead and just see how we can actually talk to this in a more 00:13:41.500 |
responsive way so i've essentially wrapped all the logic that we had before that we've just been through 00:13:48.460 |
into this async method here and it all happens within this while loop and essentially what this while 00:13:55.180 |
loop is doing is saying okay we have that input parameter from before and when you press enter for 00:14:01.500 |
that input parameter is going to start recording then what you have to do is press enter again to stop 00:14:07.580 |
recording and the pipeline is going to respond to us once we're done and we want to finish the conversation 00:14:15.100 |
we press q and enter and i'll show you how we do this and it will exit so let's go ahead and just try this 00:14:21.580 |
so now you see i need to press enter to speak so i just press enter once hello can you hear me 00:14:28.140 |
press enter again yes i can hear you in the sense that i can read your messages 00:14:36.700 |
how can i assist you today yeah so you can actually see the uh the weird response it will give you if 00:14:44.060 |
you're not prompting it well enough apparently i'm not prompting it well enough i thought i i was but 00:14:49.100 |
anyway so i'm gonna speak again hi i'm actually talking to you through a voice interface i believe 00:14:58.940 |
that the voice is being transformed into text which is what you're reading so i think you can hear me 00:15:10.700 |
i understand you're speaking to me face to face and that your voice communication might feel more 00:15:16.300 |
formal while i can't hear voice directly i process your tight messages if you'd like you can share more 00:15:23.420 |
details or ask questions and i'll do my best to help okay and i'm just going to press q to ace it so 00:15:32.140 |
seems a bit over overly enthusiastic to me and that prompting especially on okay you're actually in a 00:15:39.180 |
voice interface needs a bit of work which is fine we can of course do that but the main part of this 00:15:45.420 |
which is we are speaking and now agent is reading what we are speaking and responding back to us is 00:15:53.100 |
working and it's actually working this is with all the other one that we're using here is duty 4.1 nano which 00:15:59.100 |
is as far as i'm aware a very small model is not particularly performing which might also explain 00:16:04.220 |
the lack of understanding from my system prompt but you can see very quickly we built this voice 00:16:12.620 |
interface that we can speak to an lm through which i think is really cool and definitely going forwards 00:16:20.860 |
i think the majority of the time that normal people and maybe even developers and i think just a very 00:16:29.980 |
broad range of people interact with ai it's either going to be ai that you don't even notice is there 00:16:37.340 |
or it's going to be ai that you are actually speaking to i don't think people are necessarily 00:16:41.580 |
going to be typing all the time to ai because voice is just a much more natural way of speaking and 00:16:49.020 |
it allows for a more fluid conversation which i think is the direction that we're going not for 00:16:54.940 |
everything of course like code assistance and if i want code help maybe sometimes i'll talk to 00:17:00.380 |
an agent majority time i'm probably going to be talking typing and i don't expect that to change 00:17:07.180 |
necessarily but there are just so many modern maybe normal use cases where okay you want to just know 00:17:16.700 |
something quickly and you want the agent to web search for you and maybe just chat to it very quickly 00:17:21.260 |
and and talk through something or you want to practice your italian for example which i do all the time 00:17:27.260 |
that is really good to use a voice interface for and there are just so many other cases like that so 00:17:34.220 |
yeah i think this is very exciting and this is really a very quick way to get started with voice 00:17:41.100 |
agents which i think is great in the future i'm definitely going to cover more on voice and as i 00:17:47.900 |
mentioned this is this specific notebook is part of a broader course on the agent's sdk so i'm obviously 00:17:55.260 |
going to be talking more about the agent's sdk very soon as well but for now i'll loot there i hope all this 00:18:01.980 |
been useful and interesting so thank you very much for watching and i'll see you again in the next one bye