AI Voice Assistants with OpenAI's Agents SDK

Today we are going to introduce how to use voice within agents SDK. Now agents SDK if you don't already know is OpenAI's new AI framework and when I talk about voice I am speaking about building a voice interface with which you communicate with your agents. Now voice as a topic is very broad and there are various ways of implementing it.

One of those is through agents SDK. Agents SDK provides various voice features that just helps us build a voice agent quite easily and in my opinion would act probably more as a introduction to voice agents before you potentially step into something more advanced such as like it or daily.

That being said you can do a lot already with just voice in agents SDK so I definitely recommend starting with this. Now to get started we're going to be using one of the chapters from an upcoming course that we're putting together which is focused on agents SDK. One of these chapters is on voice and this is sort of a earlier mostly finished draft.

So the way that we would use this is either you can install everything through a pip install like so or the way that I would recommend doing this is actually to go ahead and git clone. So I'll git clone that repo navigate into that repo and what you'll first want to do is make sure you have uv installed.

Then you want to run uv vns and you set a Python version to use. It doesn't have to be 3.12.7 but that's what I would recommend just to essentially align your environment with the one I am using here. Then I would activate that environment and you see now that I'm on Python 12.7 and agents SDK course which is the environment name and I would do uv sync to just ensure that you have all of the prerequisites or the latest prerequisites installed.

Now once you've done that you can navigate to your code editor whatever you are using and open the repo and go to voice. Now I'm just going to take you through everything we need to know to get started with voice in Python. So the very first thing is actually not specific to agents SDK.

We first need to understand okay how do we handle voice or how do we handle audio in Python? How do we record audio? How do we play it back? So first you're going to jump into that. Now in our notebook you're going to make sure I have the correct kernel installed.

It should be it should already preload here although you may have to go into select a kernel and then find the vnbin Python. And then what we're going to do is take a look at the sound device library. So this is the library that we are using to handle the more handle audio in Python.

So what we will need to do before recording or playing audio is get the sample rate of our input and output devices. And we do that with sound device query devices and then we want to look for input devices and output devices. Now if we run this we should see something like this especially if you're on Macbook so I can see that I have my microphone and I can see that I have my speakers.

Okay and then the sample rate we're going to see is down here. So we're going to pull that in. So yep that's all I'm doing here. Pulling in that in sample rate and out sample rate. And then what we're going to do here is we're actually creating this input stream.

is what we're streaming our audio input into this object. And that is going to continue until we press enter because we've added this input function here. So let's go ahead and try that. Okay so you can see at the top here it says press enter to confirm your input or escape to cancel.

So actually right now as I'm talking this is recording this is recording. So I'm going to press enter and I can come down here and see that we have these recorded chunks. So what this is doing is it's recording these chunks of audio each of those is a numpy rate and each one of those chunks is 512 values that represent a very small amount of time in our audio.

So what we find is that we so right now we have 1401 of these chunks and then inside each one of those is a 512 element vector. Okay also the one that you see here is the number of audio channels that we have. My microphone is recording in mono not stereo.

So because it's mono there is just one audio channel. If it was stereo there would be two audio channels so this would be two dimensions. So what we need to do is actually concatenate all of these and it will create a single like audio array here and we can actually play that back.

So let me just open the audio and we can just play this back and see see what is in there. Okay so you see at the top here it says press enter to confirm your input or escape to cancel. Okay so actually right now as I'm talking this is recording.

So I'm going to press enter. Okay so that was the recording from before. Now that you can see that here this ran straight away. So this cell it didn't wait for the audio to play. So what we can actually do here is we would write SD wait. And this will essentially the cell will complete once the audio completes.

Now let's just come down and see what this audio looks like. So it's actually just going to be a waveform and we can see here. Okay so a pretty typical audio waveform there. So we have that. Now this is our audio file and what we need to do now is just transform it into this audio input type which is specific to Adrian's SDK.

So this is the SDK's own like audio object where we'll be storing that audio file. Okay you can see it's just it's just an array. I don't think there's anything I don't think there's anything beyond it beyond what we already did with you know having a numpy array. I think it's just a a type that I've now used to handle that array.

So we have that. Now we're moving on to all the voice pipeline stuff. So all the agent SDK stuff. Now obviously we will need an OpenAI API key at this point. So I'm going to run this and you'll need to get your OpenAI API key from platform.openai.com. And once you have your API key you're just going to paste it into the top here or wherever the little dialog box pops up for you and press center.

Okay now we have our API key in there and now what we're going to do is initialize an agent through Agents SDK. So there's nothing nothing unique or new here. Okay all we are doing is initializing normal agent. The only thing that is is actually slightly different is I'm specifying to the assistant in the instructions so that developer or system prompt that the user is speaking to you via a voice interface.

The reason I put that in there is because if I do not and I say to the agent this is I do this all the time can you hear me the agent will typically respond with no I cannot hear you but I can read what you are typing. Because the LLM as far as it's concerned is reading text it doesn't actually realize that that text is coming from a voice interface right it doesn't realize there's a there's a speech to text step happening in the middle there and it also doesn't realize that what it writes down is going to become a voice through a text to speech component afterwards.

So it's important to mention that if you are building voice agents so we run that and then we're going to pass our agent into this single agent voice workflow. Great so we have that and now we will initialize our pipeline configuration. So the pipeline configuration is essentially saying okay well what are the settings that you want to use what's the configuration you want to use for your pipeline of voice.

So the voice pipeline at a core like very basic level is going to be three main components so well there's obviously all the data transformation stuff that happens first then you're going to get to a speech to text component so that's going to convert your spoken audio into text.

That text then gets passed to an LLM. In this case we specified that it is GPT 4.1 nano. That LLM is then going to generation text which goes to a text-to-speech model and then that spoken audio from the LLM is passed back to us. So we can set various parameters in here one of those is this text-to-speech model settings so in here you can essentially tell the text-to-speech model okay what should this spoken audio be like what the what what is your system prompt essentially as a text-to-speech model and this is actually from one of the examples that OpenAI provided either in one of their blogs or their docs I don't quite remember but this is a nice example of how you can use that okay so we give it a personality a tone we specify that it should be clear articulate and steady we have a temple and emotion okay so we can use all of that and we just pass that into this voice pipeline config object and once we have that voice pipeline config or configuration and we have our agent workflow we use both of those together to initialize our voice pipeline object and this is what we're actually going to be using but this is the pipeline this is what is taking our voice converting into text passing it to LLM and so on this is what we're using here so let's try and just pass the audio that we recorded before to this pipeline and just see what happens okay there are various ways that we can pass audio and I'm going to show you some more in a moment okay so pipeline run this is a sorry this is a async method so we do need to await it so we pass our audio input into pipeline run we await it and then what we need to do is iterate through all of the streamed chunks that we are receiving or streamed events so we simply do a async for loop here and we are looking at the stream that is coming out of this result okay and then we say because there are various event types that can be sent the only one that will contain audio that we want to be returning to the user is this voice stream event audio okay that's the only one that's important to us in in in this scenario so when we see that we append the event data to our response trunks which is an anti-list here and that is you know similar to before where we had those chunks of audio that we then concatenated this is the exact same thing these are chunks of audio and we need to concatenate to create a single audio file okay so then we can go ahead and play that now the one thing that we do need to be careful with here is the sample rate that open ai generates or outputs is not the 48 kilohertz that we saw from my both my input and output devices they have their own sample rate and that is 24 kilohertz so we just specify that here okay open ai has its own sample rate so when we're receiving audio from them that is what we need to pass into here in order to play that audio the correct speed if we if we don't i can show you it's uh it's pretty weird or it's just passed so let me turn up the audio again and we can play this you mentioned that you see something at the top here could you tell me what it says or describe it i'll do my best to assist you okay so yeah it's uh a little fast so we just turn this down to 24 and this is the correct sample rate got it at the top it says press enter to confirm how can i help you with this okay and this is just responding to you know when i was talking to you earlier so it doesn't really make a ton of sense but let's go ahead and just see how we can actually talk to this in a more responsive way so i've essentially wrapped all the logic that we had before that we've just been through into this async method here and it all happens within this while loop and essentially what this while loop is doing is saying okay we have that input parameter from before and when you press enter for that input parameter is going to start recording then what you have to do is press enter again to stop recording and the pipeline is going to respond to us once we're done and we want to finish the conversation we press q and enter and i'll show you how we do this and it will exit so let's go ahead and just try this so now you see i need to press enter to speak so i just press enter once hello can you hear me press enter again yes i can hear you in the sense that i can read your messages how can i assist you today yeah so you can actually see the uh the weird response it will give you if you're not prompting it well enough apparently i'm not prompting it well enough i thought i i was but anyway so i'm gonna speak again hi i'm actually talking to you through a voice interface i believe that the voice is being transformed into text which is what you're reading so i think you can hear me i understand you're speaking to me face to face and that your voice communication might feel more formal while i can't hear voice directly i process your tight messages if you'd like you can share more details or ask questions and i'll do my best to help okay and i'm just going to press q to ace it so seems a bit over overly enthusiastic to me and that prompting especially on okay you're actually in a voice interface needs a bit of work which is fine we can of course do that but the main part of this which is we are speaking and now agent is reading what we are speaking and responding back to us is working and it's actually working this is with all the other one that we're using here is duty 4.1 nano which is as far as i'm aware a very small model is not particularly performing which might also explain the lack of understanding from my system prompt but you can see very quickly we built this voice interface that we can speak to an lm through which i think is really cool and definitely going forwards i think the majority of the time that normal people and maybe even developers and i think just a very broad range of people interact with ai it's either going to be ai that you don't even notice is there or it's going to be ai that you are actually speaking to i don't think people are necessarily going to be typing all the time to ai because voice is just a much more natural way of speaking and it allows for a more fluid conversation which i think is the direction that we're going not for everything of course like code assistance and if i want code help maybe sometimes i'll talk to an agent majority time i'm probably going to be talking typing and i don't expect that to change necessarily but there are just so many modern maybe normal use cases where okay you want to just know something quickly and you want the agent to web search for you and maybe just chat to it very quickly and and talk through something or you want to practice your italian for example which i do all the time that is really good to use a voice interface for and there are just so many other cases like that so yeah i think this is very exciting and this is really a very quick way to get started with voice agents which i think is great in the future i'm definitely going to cover more on voice and as i mentioned this is this specific notebook is part of a broader course on the agent's sdk so i'm obviously going to be talking more about the agent's sdk very soon as well but for now i'll loot there i hope all this been useful and interesting so thank you very much for watching and i'll see you again in the next one bye you you you you you you you We'll see you next time.

AI Voice Assistants with OpenAI's Agents SDK | Full Tutorial + Code

Chapters

Transcript