Building voice agents with OpenAI — Dominik Kundel, OpenAI

- - Awesome. Well, hi everyone. My name is Dominic. I work on developer experience at OpenAI and I'm excited to spend the next two hours to talk to you all about voice agents. The QR code was already on the slides. If you just entered the room, you want to try to download the dependencies as soon as possible.

So head over to that QR code or to that starter repository and follow the instructions to install. That might take a while with the internet right now, so please do that as soon as possible. You have like 15 minutes of me rambling about stuff before we get started with actually coding.

So I said we're going to talk about voice agents. I want to first put everyone on the same page because I know we all have different definitions of agents and there's going to be a lot of definitions flying around at this conference naturally. So when we're going to talk about agents, we're talking about systems that are going to accomplish tasks independently on behalf of users.

And most importantly, they're going to be essentially a combination of a model that is equipped with some set of instructions that then has access to tools that can both be used to work on that goal. And then all of that is encapsulated into a runtime to manage that life cycle.

And that's an important definition because today we launched the OpenAI Agents SDK for TypeScript. If you've heard of the one for Python, today we basically released the TypeScript equivalent. And so we're going to use that, and it maps those exact patterns. So if you're unfamiliar with the Agents SDK, it's basically a SDK that provides you with an abstraction based on the best practices that we learned at OpenAI to build agents.

And it comes with a couple of different base foundational features, including things like handoffs, guardrails, streaming input and output, tools, MCP support, build-in tracing so you can see actually what your agents did and how they interacted with each other. And then additionally to those features that are coming from the Python SDK, the SDK we launched today in TypeScript also includes human-in-the-loop support with reasonability so that if you need to wait for human approval for a while, you can deal with that.

And most importantly, native voice agent support. What that means in practice is you can use those same primitives that we already have in the Agents SDK, and you can build voice agents with that that handle handoffs, have output guardrails to make sure that the agent is not saying things it's not supposed to, tool calling, context management, meaning keeping track of the conversation history so you can use it in other applications, and build-in tracing support so that you can actually replay conversations, listen to the audio of the user and properly debug what happened, plus native interruption support.

If you've tried to build interruptions, you might know how hard this is. If you haven't, be glad you don't have to. Both WebRTC and WebSocket support, meaning it's actually -- can communicate both on the server for things like Twilio communicate -- like phone call, voice agents, or directly in the client in the browser.

That's what we're going to use today using WebRTC. But first, why would we be interested in voice agents in the first place? One of the things that I'm most excited about is it makes technology much more accessible to people. There's something magical about being able to talk to a voice agent and just have it -- kind of like see it do things.

It's also much more information dense. I can convey information much faster, but also it can contain a lot of information through the type of tone and voice that I'm using, the emotions. So it's much more information dense than sort of just basic text is. One of the cool things is also it can act as like an API to the real world.

You can have a voice agent go and like call a business for you and like have a conversation with them where maybe there isn't an API for that business. And so when we talk about building voice agents, there's essentially two types of architectures that have emerged when building these.

The first one is based on your traditional text-based agent and just sort of wrapping it into a chained approach where we have a speech-to-text model that is taking the audio and then turning it into text so that we can run our basic text-based agent on it. And then we take that and we run that agent, take the text output, and run it through a text-to-speech model to generate audio that we can play again.

This has a lot of strengths. One of the most common reasons why we raise this is it's much easier to get started with if you already have a text-based agent. You can take that, wrap some audio around it, and you have something that you can interact with. But the other aspect is that you have full access to any model.

Text is the main modality that any LLM has, and so you can use really any of the cutting-edge models. It also gives you much more control and visibility of what the model did by being able to actually look into the exact text that went in and out of the model.

But it also comes with some challenges. Turn detection is one of the big ones where you need to now take into consideration what did the user hear by the time that they interrupted the voice agent, then translate that part back into text. Make sure that your transcript is appropriately adapted so that the model doesn't think it told the user something that it didn't.

Chaining all of these models together adds latency on every possible level, and so that's another big challenge. And then you're losing some of that audio context, right? You're transcribing the audio, and if you've ever tried to convey a complicated topic over a text, you know it's a bit harder than dealing with the same thing using your own voice.

So, an alternative to that chained approach is a speech-to-speech approach where we have a model that has been trained on audio, and then takes that audio to directly interact on the conversation and make tool calls, meaning there's no transcribing in the process. The model can just natively deal with that audio, and that translates into much lower latency because we're now skipping those speech-to-text, text-to-speech processes.

We can also now have much more contextual understanding of the audio, including things like tone and voice. And all of that leads to a much more natural, fluid level of conversation. But there are some challenges with this. One of the most common ones is reusing your existing capabilities. Everything is built around text, so if you already have some of those existing capabilities or a very specialized agent for a certain task, it's harder to reuse those.

Also, dealing with complex states and complex decision-making. It's a bit harder with these models, since they've been really focused on improving on the audio conversational tone, less so on being very complex decision-makers. But there is a solution that we can get around with this. Again, taking inspiration from what we do with text-based agents, we can actually create a delegation approach using tools, where we have a frontline agent that is talking continuously to the user, and then that one uses tool calls to interact with much smarter reasoning models like 04 mini or 03.

Actually, let me, at this point, give you a quick demo and see how the internet goes here. So I have a real-time agent here that we built with the Agent SDK. It's going to be very similar to what you're going to build later on. But when I start talking to this, hello there.

Hello there. How is the AI engineer at World's Fair going? So we can now give it, like, give a task to, like, call tools that I gave it, like, "Hey, what's the weather today?" "Let me check the weather for you. One moment, please. Transferring you to the weather expert now.

Actually, I can directly help you with the weather information. Could you please specify the location you're interested in?" Oh, yeah. What's the weather in San Francisco? So you can see here it's actually dealing with the interruption. Enjoy the bright and pleasant day. Is there anything else I can assist you with?

No, that's it. Thanks. You're welcome. If you need anything else, feel free to ask. Have a great day. And so in a similar way, we can actually trigger the more complicated back-end agents as well. So I have a tool for this to handle refunds that will call out to 04 mini and evaluate the refunds.

So hey there. I have one more thing. So I recently ordered this skateboard that I tried to use, and it seems like I'm really bad at skateboarding. So I want to return it. It is slightly a scratch, though. I'm here to assist, but it sounds like you need customer service for that.

I recommend contacting the company where you bought the skateboard. They can provide you with the return. Looks like I didn't add the tool. Maybe I did. Oh, I didn't ask for a refund. Let's try this once more. Hey there. I bought a skateboard recently that I tried, and apparently I'm really bad at using it.

So I wanted to return it. It is slightly scratched, so can you give me a refund? Hello there. How is the AI? Hello there. How is the AI engine? The joys of the joys of internet. Hey, I recently ordered a skateboard from you, and it failed. Like, I can't use it.

I'm struggling to use it. It's slightly scratched. Can you give me a refund, please? Hello there. How is the AI? I'm going to assess your request for a refund. There we go. Let's get started. It is slightly struggling with this, like, weird echo that we're having here. The skateboard arrived damaged, and you're eligible for a full refund.

We'll process that for you. All right. But you can see here that it was able to call that more advanced tool and actually process that request. And one of the nice things is that, like, while time to first token is often a really important thing, the longer a conversation goes, like, your model is always going to be faster than the audio that has to be read out.

And so this is, like, a really helpful thing where, by the time that the model was able to say, like, hey, I'm going to check on this for you, it already had completed that LM call to the 04 mini model to get the response there. All right. Let me -- oh, one more thing.

Since we talked about traces, one of the nice things now is we can actually go back here into our traces UI. And this launch today, you'll be able to actually look for any of your real-time API cases, look at all the audio that it dealt with and all the tool calls.

So we can actually see here that the tool call was triggered, what the input was, the output. We can listen to some of the audio again to understand what happened. And then because both this and the back-end agent use the agent's SDK, we can go into the other agent as well, which was the 04 mini one, which we can see here.

And we can see that it received the context of the full past conversation, the full transcript, as well as additional information about the request, and then generated the response here. So this allows us to then get a full, complete picture of, like, what happened both in the front-end and the back-end.

Let's jump back into the slides and cover a couple of more things before we get coding. And that's about best practices. So I would group the best practices of, like, building a voice agent into three main things to keep in mind. The first one is to start with a small and clear goal.

This is super important because measuring the performance of a text-based agent -- you will hear a lot about evals at this conference -- is already hard enough. But with voice agents, it's going to be even harder. So you want to make sure that you're very focused on, like, what is the first problem you want to solve and keep it focused on that and give it a like limited number of tools so that you're fully centered on this.

The agent's SDK makes this really easy because you can then later on add additional tools to additional agents and deal with, like, handoffs between them. But this way you can kind of really stay focused and make sure that one of your use cases is great and then hand off other ones to human agents, for example.

The second one is what I elaborated on, which is building evals and guardrails very early on so that you can feel both confident in what you're building but also confident in that it's actually working so that you can then continue to iterate on it and know when it's time for you to, like, grow the complexity of your voice agent.

As of today, you can use the traces dashboard for that. But alternatively, some of our customers have even built their own dashboards like Lemonade to really get an end-to-end idea of the customer experience and then even replace some of these conversations with their agent as they're iterating on it.

The other thing that I'm personally super excited about with these models is both our speech-to-speech model and our text-to-speech model are generative models, meaning you can prompt them the same way that you can prompt a LLM around tone and voice and you can give it emotions, roles, personality. We built this little microsite called openai.fm.

It's a really fun website to play around with where we have a lot of examples of different personalities and how that style of prompt can then change what is being read out by our text-to-speech model. And so that's a great way for you to not just limit one second, limit the experience of your model or, like, the personality of your model by the voice that you picked, but also by the prompt and instructions that you're giving it.

That was a question there. Would you mind using the mic that is right behind you just so that it's on the recording? Hello, sir. So my question regards to the previous slides on Lemony. So you're displaying how they have this dashboard where they can show all of this. Is this a dashboard that OpenAI provides and Lemony just integrates as, like, an iPhone or something?

No. So in this case, they built their own solution for it. Okay. And does OpenAI then provides all the JSON or the data structure that we can just plug into the... So the way the real-time API under the hood works is that you get all the audio data and you can do whatever you want with that, basically.

You're getting all the necessary audio events so you can use those data structures. So we're not storing them by default. You can use the Traces dashboard. We don't have an API for it yet, but you can use the Traces dashboard to get a basic look of that, but it's not iframeable.

But you mentioned it's only audio data. This shows not just audio, but also the transcription and all of that as well, right? So the Traces dashboard, if we go back to it, does show all of the transcripts and stuff as well, as long as you have transcription turned on, which I don't seem to have turned on for this particular one.

But it should, like, you can turn on transcription and you should be able to see the transcripts as well. Okay. Thank you. You're welcome. All right. Let's go back to this. The other part with it is, as I said, you can prompt both the personality. You can also be very descriptive with the conversation flows.

One of our colleagues found that giving it conversation states instead of this JSON structure is a great way to help the model things through sort of what processes and what steps it should go through the same way that you would give a human agent a script to operate on.

If you're struggling to write those scripts, though, we also have a custom GPT that you can use to access that. And I'll share all of those links and a copy of the slide deck later on in the Slack channel. So if you're in that, you should be able to access those.

But with that, that's primarily what I wanted to talk through from a slides perspective. So from here on, what I want to do is build with you a voice agent. We'll see how that goes with the internet. Also, if you have headphones, now is a great time to bring them out.

It's going to be really weird when we're all going to talk to our own agent. But we're going to try this and see how that goes. So if you came in later, please scan the QR code. Go to that GitHub repository and set that up. Install the instructions. There's no code in it yet other than like a boilerplate Next.js app and a empty like package JSON that install just like the dependencies that we needed so that we are not all trying to run npm install at the same time.

But what I want to do is build a first agent. So if you want, you can just straight up copy the code that is on here. But I'm going to actually go and type it along with you all so that you get a feeling for what's happening. And we have a good idea of timing.

So if you want to take a picture now, just code ahead. Do that. And otherwise, I'm going to switch over to my code editor and we're going to do this together. So if you're running into trouble, the Slack is a great way to post questions that are technical questions.

And Anoop, who's over there, is going to try to help you. Alternatively, raise your hand, but it's a bit easier if you're just slacking the messages there and we can kind of multi-thread the problem. All right. Let's go and build an agent. So if you clone the project, you should see an index index.ts file.

Go and open that and you should be able to import the agent class from the openai/agents package. That's what we're going to use to create the first agent. Yeah? Oh, yeah. Good call. Is that better? Cool. That seems a bit -- seems worse on my side than yours. But I think as long as you all can read that, I'll be fine.

All right. So what I want you to do is go and import an agent. And we're going to define our first agent. And as I mentioned, primarily, an agent has, like, a few centerpieces. The first one being a set of instructions on what to do. So we can give it instructions.

I'm going to say, you're a helpful assistant. It's sort of the most boilerplate thing you can do. We do need to also give it a name. And that's so that we can actually keep track of them in our traces dashboard. I'm going to say my agent here. This can be anything that helps you identify it.

And then we need to actually execute this agent. So we can import a run function here. And then we can await the run here. I'm going to run this agent with just hello, how are you? And then log out the results. And with the results, we get a lot of different information.

Because essentially, when we run an agent, it's going to do a lot of different tasks, from executing all necessary tool calls, if there are any, to validating output guardrails, etc. But one of the most common things that you just want to log out is the final output. That's whatever the last agent in an execution said.

So in this case, it's going to be a set of text. And then you should be able to run npm start, npm run start 01. And that should execute it. And then you should see something like this, depending on what your model decides to generate. And by default, this is going to run GPT 4.1 as the model.

But if you want to experiment with this, you can set the model property here. And we can set it to 04 mini, for example, and then rerun the same thing. So this is the most basic agent that you can build. But one of the things that really makes something an agent is if it can execute tools.

So we can import a tool here. There we go. And we can define a get weather tool. One of the things here is you have to specify what arguments the model is going to receive. And one of the ways that you can do this is through a tool called Zod.

If you've never heard of it, it's essentially a way to define schemas. And what we'll do is we'll both use that Zod schema to inform the model on what the parameters for this function call are. But we're also going to use it to validate then what are the actual arguments that the model tried to pass in and do they fit to that schema.

So we get full type safety here. If you're a TypeScript developer and you care about that. So in this case, we have a get weather tool. And then we can give that tool to the agent and we can change this to what is the weather in Tokyo is what cursor wants to check.

So if I run this again, let me move this slightly. We can see it's going to take a bit longer now. And that's because it ran some tools. And now it's telling me the weather in Tokyo is sunny. And if you're wondering, well, did it actually run a tool?

We can go into our traces dashboard here and look at the trace. We have a my agent here. And then there we can see it ran, tried to call the tool, executed the tool and got the weather in Tokyo's sunny back, and then took the response to generate the final response.

So the traces dashboard is a great way for you to see what actually happened behind the scenes. How are we feeling? Can I get a quick temperature check? Are people able to follow along? I see Zeke is giving a thumbs up there. So this is a text based agent.

I wanted to show you this just to get a bit familiar with the overall agents SDK so that we can jump into building voice agents. The first thing we need to understand about a voice agent is the slight differences between a voice agent and what we call a real-time agent.

Essentially, a real-time agent is just a specialized version of an agent configuration. There's just a few fields you can't pass in. But they can be used in what's called a real-time session. Because with voice agents, there's a lot more things to deal with than just executing tools in a loop.

One of the most important things is you need to deal with both the audio that's coming in, process that, and then run the model with that, and then deal with the audio that's coming out. But you also need to think about things like guardrails, handoffs, other lifecycle things. And so the real-time session is really dealing with all of that.

So let me show you how that works. For this, what we're going to do is we're going to go in the same project. There's a 02. And it has a page TSX in there. This is a Next.js app that really, I just gutted to have the bare minimum in there.

But this is a great way for us to just build both the front-end and the back-end part of the voice experience. Because this voice agent that we're going to build is going to run in the browser. In order to make sure we're not leaking your API credentials, one of the important things is you need to use an ephemeral key.

That is a key that is short-lived and is going to be generated by your server and handed off to your client so that they can use that to interact with the real-time API over a protocol called WebRTC. For that, you should see a token.ts file in your repository that just calls out to the real-time API to generate a session and then return a client secret, which is that ephemeral key that we can then use to authenticate with the SDK.

You do not have to do this if you're building a real-time agent that is running on your server. For example, in the case of Twilio app or something else where you can just directly interact with the OpenAI API key. But if you're running anything in the browser, then you actually need to generate this client key just so that you're not, you know, giving your API key to the world.

So with that in here, we can actually go and build our first real-time agent. So similar to previously, we're going to import an agent class here. But in this case, it's going to be a real-time agent. We're going to import it from the real-time package, which is just a sub path in the same package.

So you don't need to install a different package here. But now we can define a real-time agent that works the same way. We have a name. We give it instructions. Just sort of going with a default suggestion here. And now we actually need to connect that agent to a real-time session.

So I have a connect button here for running this example. Let me start it up here with npm run start 02. That command should be in your readme as well. It's going to start a development server. And we can go over here, reload this. And you can actually see it just has like a little connect button that right now doesn't do anything.

So let's connect that up. I don't need this anymore. Let me just move this to the side. So in this on connect function that gets triggered whenever we press the button, we want to deal with that connected state. So what we're going to do here is we're first going to fetch that token.

And this code, what this basically does is it's going to import that server action, which is the next JS concept that just makes sure that like this code is going to run on your backend. If you're using a different framework, you should be able to just go and fetch this key from your backend server.

And then once we have that token, we can go and create a new real-time session. So what we're doing here is we're going to give it the first agent that should start the conversation up. I'm going to specify the latest model that we released today along with the agent's SDK.

This model is a if you've used the real-time API before, it's an improvement, especially around tool calling. It's much better on that front. We have a couple of different customer stories on our Twitter, if you want to check that out. And then I'm going to give it not there.

I don't know my cursor insists on that. The last step that we need to do is we need to connect to that session. So this is where we're going to give it that API key so that we can connect to the real-time session under the hood. Just so that it's easier for us to deal with all of this, I'm also going to close the session.

But I've got one thing here that is an oddity of React. We do not want to generate that session every time. So I'm going to on every re-render. So I'm going to create a what is called a ref here. Again, if you're new to React, this basically is just a variable that's going to persist through re-renders.

So we need to slightly change this here. We're going to assign that to session.current so we can maintain that. And then that also allows us to say if there is a session.current set, we want to actually close that connection when we press the disconnect button. That just makes sure that we're disconnected from the audio again.

So I'm going to leave that on the screen for a second, and then we can test this out. But if you already typed this, go into your browser, refresh, press connect, and you should be able to talk to your agent. All right, let's try mine. Let me move this to the other side so it's not blocking your code.

Hello? Hi there. How can I assist these days? All right. So you can see it. It's just a few lines of code. We didn't have to deal with things like figuring out how to set up the microphone, how to set up the speakers. By default, if it's running in the browser, it will deal with all of that automatically.

If you do want to pass in your own microphone source or other things like that, you can do that as well. If this is running on a server, you have both a send audio function that allows you to send an audio buffer in, or you can listen to the audio event, which is going to emit all of the audio buffers that are coming back from the model so that you can pass it to whatever your source is.

So that's our first basic agent. Any questions so far? Please update the Rappo. Can you send that code to the Rappo? You want you to send the code to the Rappo? Can you push it? Can you push it? I can push it, yeah. Good call. Thank you. All right.

So now that we have that, let's go and actually give it a tool. So this is really where the benefit of the agents SDK comes in. We can actually use that same tool definition that we did earlier. So I'm just going to follow the autocomplete here. We should be able to just give that tool now to our agent and save.

I need to import Zod again to do that schema validation. This is especially important on the real-time side because the real-time model currently does not support strict mode. So the JSON might not fully comply with your schema unless you're giving us a Zod schema and we'll go and validate that this actually fits that schema.

So that makes your code a bit easier. So with that we can go back. Hey, what's the weather in San Francisco? The weather in San Francisco is sunny today. We can disconnect it here. Also, this does now deal with interruption. So, what's the weather in San Francisco is the weather in San Francisco.

The weather in San Francisco is sunny today. So the weather in San Francisco is sunny today. So the weather in San Francisco is sunny today. So the weather in San Francisco is sunny today. So the weather in San Francisco is sunny today. So the weather in San Francisco is sunny today.

So the weather in San Francisco is sunny today. So the weather in San Francisco is sunny today. So the weather in San Francisco is sunny today. So the weather in San Francisco is sunny today. So the weather in San Francisco is sunny today. So the weather in San Francisco is sunny today.

So the weather in San Francisco is sunny today. Normally, that's enough to deal with the context. But it is super crucial to have that interruption timing. So that like your model doesn't think it read out like the full customer policy. But the customer interrupted it halfway through, for example.

All right. Question. You don't have to manage all the events to actually do that anymore? No. So the real-time session will handle all of those events. What we can do is listen to the transport event. And let's do this later. This will log out all of the events that are happening under the hood.

So if we open the dev tools here and rerun this. Hey. Hey there. How can I help you today? So you can see all of the events that normally you would have to deal with are being dealt with. You still have full access to them. So you can both read them, but you can also send events yourself.

So it's going to handle all of the things. But continue to pass them on to you if you want to do your own logic on top of that. I'm going to push that code for you so you can pull it. Cool. All right. Since we already have this commented out code, the other part of this that typically is a request that you want to deal with is I want to show like the transcript.

I want to see what sort of is being transcribed. And the important thing here is I'm using the word transcribe because even though the speech-to-speech model is dealing with the audio directly and there is no transcription step in between, by default, we're going to transcribe all of the conversation at the same time.

You can turn it off if you want to. If you're using the API directly, you have to actually turn it on. In the agent SDK, it's turned on by default because it's such a common request. And it enables us to do a couple of additional features that we'll cover later on.

But this is going to give us that whole history every time. So I'm just going to log that history here. Or rather, I'm going to - there we go - import that. I'm going to set that as a variable. And then because it's React, we can create a list here.

We're going to go over all of this. I need to filter because it has both tool calls and messages. And I only want to show the messages for this. So I should be able to - why does it want that? Let's see. Close this. Refresh. Hey. Hello. How can I assist you today?

How's the weather today in San Francisco? The weather in San Francisco today is sunny. Anything else you'd like to know? So you're automatically getting that conversation. If you are interrupting the model, one of the things that happens is the transcript is going to disappear, and that's because the model currently does not adjust that transcript.

And instead, it's going to be removed. And we're going to remove it from that object as well, just so that you get the most accurate representation and you're not thinking that, like, the model read out a certain piece of text. And again, with everything that we're doing here, we can actually go back into traces.

And we can see that same representation here with the weather call and everything. So, again, it helps with the debugging. I'm going to go briefly back to the slides. So we covered - we set up our first agent. Yeah. The question was, how do you store the conversation history?

So it's currently fully stored in memory. So basically, there is going to be a bunch of events that I logged out that are being emitted by the real-time API. All of those are going to be sent over to the client and then stored in memory in a conversation, like, in just, like, an array, essentially.

So you can do whatever you want with that by listening to that history updated event. So if you do want to store it somewhere, you can store it. The other part is the traces part is automatically going to be stored on the OpenAI platform. As long as you both enable that tracing, you can disable it by default in the agent SDKs enabled.

And then the other aspect of that is if you are a ZDR customer, so a zero-data-retention customer of OpenAI, you don't have access to that traces feature. The question was, how much of the, like, voice context, how much of the previous conversation is being used? That's going to depend and sort of, like, dealt with directly by the real-time API.

So, like, the real-time API, when you start that session, that holds the source of truth for that whole conversation session. So what you're receiving on the client side is just a copy of whatever is happening at that point. It's not the source of truth of what we're going to adapt to pass into the model.

The question is, how does it work with, like, the inference cost and, like, whether you're passing it, like, passing in that whole conversation? And Noop is nodding. He is the bigger expert there. But yes, we're actually, like, you can log the, like, we're keeping track of the usage. There's an event that you can, like, log out to see your token cost.

So you have an idea of, like, what is being actually passed in. So, like, with every, if we're going back here to this example, you can see these response done events. I don't know, where is the, shouldn't it be on the response done? It is being sent over. I just do not know right now why it's not showing.

Oh, there. So you can see here, it outputs the detailed information of your token usage at any point in time. So while you don't have, like, access to, like, what is exactly what was passed into the next response generation, you can keep track of the cost as it's happening.

You're welcome. Yes. There's a microphone right over there. That might be easier than you yelling across the stage to me. I see that the format that you're using is PCM 16. Is there a way in which we can modify the output formats of the audio files so we can save in memory?

Um, yeah, there are different, different audio modes that you can use. Um, including, like, for example, ULaw for, that is, like, helpful for phone calls, for example. Another question on the usage. Does that, like, final assistant response roll up all the tokens from, like, all the intermittent tool calls as well?

Does that make sense? Like, the agent needs to, like, kind of reason through and then format tool calls. So I'm assuming it's not just the output tokens for only the assistant response, right? It, like, every tool call is a response in general as well. So, like, it works the same way that, like, the responses API works, for example.

Okay. So, like, each, right? Because we're using this and we have, like, tool calls and tool call outputs, right? And I couldn't find the, like, usage attribute on the tool call output. Is it somewhere in those, like, raw events that are outputted? Do you know, Anu? Okay, no worries.

I know it's, like, kind of early on. We can follow up. All right. Thank you. You're welcome. Yeah, do you want to head over to that microphone that is right behind you? Yeah. It just makes it a bit easier. Oh, yeah. There's, in the meantime, if you want to...

Just a quick question. Can I go back to the slides explaining the different modes of the audio agents, like, the text in, text out, that's the first one. Oh, yeah. Text to speech, that's the second one. Oh, yeah. I didn't get the third one, and... Oh, you mean this?

Yes. Yes. Yes, it's... When you just showed us the GPT-4 real-time, that one... Yeah. Is that... This PPT... This slide is about... Yeah, exactly. So, like, where it's, like... When we did the refund, it kind of followed this pattern, where it performs a tool call... Like, the, like, real-time API agent can perform tool calls.

It performed a tool call to trigger a separate agent that was the refund agent that, in my case, used O4 Mini to execute that task and then hand that back. Okay. Got it. Thanks. You're welcome. Yes. I'm currently using, like, a regular OpenAI agent. So, what will be the challenge that we face when we want to change my regular agents to real-time agents?

So, there's a couple of different challenges. Like, one is, like, anything that you're doing around latency... Like, anything you're doing around voice, latency is always king. So, like, you want to figure out what are the best ways to... I actually have a slide around this. Like, when it comes to things like tool calling, you want to find ways to do things like buying yourself some time.

So, you will typically see some prompting around, like, announce what you're about to do next before you're doing it. And that's to do that little trick around while the previous audio is still being read out. The agent can already perform the tool call and wait for the tool call to come back.

Because, similar to a text-based agent, the model can't do... Like, can't receive additional data as, like... Like, do another thing outside of, like, we can interrupt the response, but it can't finish that response, if that makes sense. And so, you want to do these sort of, like, buying time.

The other thing is, like, if you're building a real-time agent, the longer your prompt gets, at one point it increases the likelihood that it gets confused. So, you want to make sure you're properly scoping those use cases and, like, through what we call handoffs, where you have different agents that are more scope to specific steps in your experience.

Thank you. You're welcome. Yes. Can you speak a little bit more about memory? Earlier, you said that... Is that short-term, long-term, such a... Yeah, so you're... Yeah, so the question is about memory. Basically, right now, the... Let me correct this. When we go back to this demo, what you're seeing here is essentially just, like, a copy of the events that we're receiving back.

So, this is, like, helpful as a visualization of the history. That being said, the actual, like, memory in the sense of, like, an LM agent memory is the session context that is happening on the real-time API side. There are events that you can use to update that. We actually have an update history event that you can pass in what you want the history to be.

But what that does is essentially, like, fire off events to the real-time API to say, like, delete this item from the history or add this new item. And you can give it a previous item ID. So, like, you can, for example, like, slot messages into a specific spot if you wanted to.

Does that make sense? But there's, like, no, like, advanced, like, long-term memory solution like you were alluding to. Cool? Yes. Hi. Do you have tips for handling input from low-fluency users? Like, say someone who's just learning a language and they have, like, multilingual input and maybe broken grammar and their pronunciation is not so good?

I don't think I have any, like, best practices right now that I could share. Can it handle it just off the shelf? It can handle, like, switching languages and things like that. Okay. But it might not be able to handle low fluency. I don't know if we have any use cases.

Yeah, we have some customers that are, like, language learning companies. So, there is some that are using it that way. But I don't think I have any, like, best practices that I can share. Okay, thank you. You're welcome. Sorry. Back in the code, is there a callback for the interrupt?

And does it include the last transcription? Um, there is a call for the interrupt, but, uh, there is no, um, there's no actual, like, event that -- There's no param or e that comes with it or anything like that? No, there's currently no, um, transcript. So, what you can do is, if you're getting this, you can call -- Get history or something?

Uh, there's just history. Okay. So, like, this always is up to date. Cool. Um, so you can -- you have access in that moment. Okay. The thing that we do have is, um, for tool calls specifically, um, you're getting some additional context, and that context has a, um, history parameter that you can, like, push into.

Okay. Um, it's more documented in the -- in the documentation. In the API. Okay, great. Thank you. You're welcome. Awesome. Um, let's move a bit on and show a couple of other things. So, we talked about tools. As I said, like, one of the benefits is you can reuse the same syntax that you're doing with text-based ones.

Um, it's also a good way for you to then communicate with your back-end systems using HTTP. Um, follow sort of a, um, general practice around, like, keeping both the tool calls as low latency as possible. Like, send out a tool. Like, for example, if you know a task is going to take longer, start the task, give it a task ID, and have the agent have a tool to check on the status, for example.

Like, that helps getting back to it because, again, while the tool call is going on, the model is sort of stuck. So, you want to -- you want to make sure to, like, get back to that as soon as possible. Um, one of the other things that you can do is human approval -- uh, human approval.

I can show you that quickly. There's essentially a use, uh, it's a needs approval that, um, you can either specify as a function that will be evaluated before the tool ever gets triggered. This is a great way if you have, like, a more complex logic on "I need approval for this." You can also give it just straight up, "I always need approval," at which point there is a another event here, um, tool approval requested, and then that gets a, um, event here, so we can do things like, um, good old prompt.

Um, and then we can go and approve that tool call again. I don't know why the autocomplete is not working. Um, um, proof. There we go. And, uh, why is it -- this is where I go into the docs because I do not remember why this is autocompleting the wrong way.

But everything I'm showing you is in the docs. Um, so we can just -- oh, took the wrong thing, right? The first -- there we go. Approval request. Approval request. Thank you. It's like the classic thing when you're on stage and you can't really -- there we go. So, in this case, I'm just going to always approve.

But if we now go in, go in, "Hey, um, can you tell me the weather in Seattle?" So we can, in that case, approve it. It's always going to approve right now because I'm not actually checking the status. But, um, that means you can build, like, a human in the loop approval experience.

This is really convenient, especially if you're running it in the browser and you just want to have, like, a confirmation of, like, the tool is hallucinating things before the customer actually submits it. And does it do it directly? Can it actually say, "Are you okay if I do this?" So basically, this is happening -- so the question is, does it automatically do this?

Like, the -- what we're doing and the reason why this is separate is the model is asking for this tool to be executed. But we're intercepting this, basically, before we're ever generating or executing the response. This is intentional so that, like, you don't have to deal with -- like, we want you to think through why should this tool need approval as opposed to doing that somewhere halfway through your tool execution.

And you have to, like, deal with the consequence of rolling back every decision that you've made, for example. And so, by default, if this is just needs true, it cannot get past that until the execution was approved, at which point it stores it in the context that is stored locally and then bypasses that security.

So this is not happening on the model level. I'm going to remove that again. So the other thing we talked about already, but I want to show it in practice, is handoffs. So a handoff is essentially just a specialized tool, a call that resets the configuration of the agent in the session.

So that we can update the system instructions, we can update the tools, and make sure that we can nicely scope the tasks of what we're trying to solve. So what you cannot do -- I know people are probably going to ask about this -- is you can't change the voice of the agent mid-session.

You could define different voices on different agents, but the moment that you're, like, the first agent that starts talking, that's the voice that we're going to stick with throughout the entire conversation. So that's a caveat to just keep in mind. But they're still very helpful to, let's say, have a weather agent here.

We'll do this. And then what we can do is we can actually give it a handoff description. So if you don't want to have this in your system prom, but you just want to help them, we'll do this. And then we can actually give it a handoff description. So if you don't want to have this in your system prom, but you just want to help the model understand when to use this, you can say, like, this agent is an expert in weather.

And then this one is going to have that weather tool. We're going to remove it from this one, and we're going to give it a handoff instead to that other weather agent. So now if I'm going to restart this. Hey, can you tell me the weather in New York?

The weather in New York is sunny, so you might want to grab your sunglasses if you're heading outside. Enjoy the day. All right. That's that's the model's best attempt at a New York accent. We'll take it. But you can see there that, like, it automatically handed off from that first agent to that second one and let it handle it.

You can, through prompting, do things like, do you want it to announce that it's about to handoff? Do you not want to do that? Sometimes it's a bit awkward if you're forcing it to always do it. So, like, I would not necessarily try it, but maybe that's the type of experience that you want to have.

So that's handoffs. Let me do you a favor and push that code. Bush, maybe Bush. So the agent can't change the voice when passing to another one, but it can't change accents? Yeah, so that's a good question. The agent can't change the voice, but it can change the accent.

Again, this goes back to, like, the model is a generative model. So you can prompt it to have different, like, pronunciations, tonality, like, voice in that sense, but it cannot change the voice model that is actually being used to generate that output. So maybe as a extension of that, is it, like, the whole real-time request body that can't be changed or just the voice?

So can I, like, create a tool that could adjust the speed if someone was saying it's talking too fast for the noise reduction? You should be able to change. I have not tried the speed speed parameter changing at mid-session because it literally came out today. I don't know. Anoop of you tried this.

No. But, like, you can, like, essentially a handoff does change the session configuration. Like, if we look back at, like, the, like, one of the transcripts here, like, now that we have a handoff. Let's go to this trace. So you can see here that, like, it called the transfer to weather agent.

But then, like, these instructions were talked with a New York accent. So in this case, it did change the instructions midway through the session. And the same way, like, when that handoff happens, we take away the tools, we give it new tools. So you can change those tools. You could have a tool to change the tool, but my recommendation would be, like, use a different agent for that.

But then, like, the speed control, like, you could, you should be able to send off an event, but I have not tried that. Yeah, or maybe, like, like, the background. Like, basically, like, if you had something and someone was, like, in a noisy environment, like, hey, you seem to be getting interrupted.

Could you adjust what we catch in the background and start adjusting that parameter so just the voice is protected as far as possible? Yeah. So the question is, like, for example, if someone is, like, in a noisy environment, like, could you have the agent detect that and then use, like, adjust some of the session configuration to deal with that and just the voice is protected?

I don't know, honestly, which parameters are protected or not. The good thing is, like, the API will throw an error if the thing didn't work. So it's a good way. It's a good thing to experiment with. You could do that in Python. Hmm? With the previous ones, you could do that in Python.

Oh, yeah. Well, with the Python Agents SDK, we're doing the chained approach. We don't have a real-time built-in yet. So... Just calling the old API in Python, you could change it. Oh, all right. Yeah, yeah. Then it should work. Yeah. If you can do it in Python, like, it should just work.

Cool. So the other thing we talked about is this delegation part. So that's what I, like, had mentioned earlier that was in the diagram. So this is essentially where you want to be able to have certain complex tasks dealt with by a more intelligent model. And the way we can do that is essentially just creating another agent except on the back end.

And because the TypeScript SDK works both in the front and back end, we can do that through -- I think I have a -- let's see if we have a file here or not. We can do that using the same SDK. So I'm going to create on the -- in the server folder here, a new file I'm going to call just agent.

And in here, we can build our regular text-based agent. So this is essentially the same code that we've done before. And we can say -- this is a -- I don't know -- called the Riddler. You are excellent at creating riddles based on a target demographic and topic. And we'll just give it a model of O4mini.

Also, a reminder for those, if you are trying to follow along and you run into troubles, post in the Slack, and Anoop can help you with that. So we have that new agent here. We're not going to give it any tools or anything. And then we can export a function here that we just call a run agent.

And this is just going to take some input and then return that output. And we can go back into our front-end code, create a new tool here, create riddle. And this one, we're just going to have -- take like two parameters, the demographic and the topic, and then call out the run agent function that is going to run on the server.

We can pass in -- actually, realize I didn't specify this. Let's do demographic and topic. And then create an input here of this. The other thing you want to do when you're using server actions in next is put that used server at the top. That makes sure that this file executes on the server.

And then we can pass in that demographic. And again, if you're using a different framework, this is just the equivalent of a fancy fetch request. So if you want to do an HTTP request to your server, if you want to maintain a WebSocket connection to your own backend, you can do all of those things to talk back to other systems.

So with that, we can give that to our main agent. And then what you want to do in these cases is -- you can tell it like announce when you are about to do a task. Don't say you are calling a tool. Things like that can be helpful to like buy itself some time.

Hey there. Can you tell me a riddle for like a five-year-old Star Wars fan? Hey are you there? I'm still here. I'm working on creating that Star Wars riddle for you. It should be ready in just a moment. Here's a riddle for your little Star Wars fan. I'm not -- So you can see that like because it announced that like what it's about to do, the tool call came back before it even finished what it previously said.

And so like that's again one of the benefits of like if you can get your agent to balance out that and like buy itself some time, this is a good way to deal with the more complex tasks. And like it also means that you can like for example take all of the like more reasoning heavy workloads and take it out of the voice agent model.

For delegation, is it possible to delegate to more than one agent like simultaneously or is it just one in the current SDK? You can -- it's tool called. So like I think you like you would have two options, right? Like you could do like parallel tool calling or you could like have one tool that then triggers running multiple agents, right?

So like like my recommendation would be that part potentially so that you're not relying on the model making the right decision of calling multiple full tools at the same time. Like you want to make the decision making for the voice agent always as easy as possible. All right, thanks.

Yep. Previously when you did -- you go back to the previous page, the -- Yeah. Oh, no. Sorry, your example. Which example? The one you're running where you had the output. Oh, here. So on line three there where it said I'm still here. Yeah. Is that coming from your SDK?

Do I have to use the SDK to do that? It's the real-time API. No, it's the real-time API. It responded because I asked like, hey, are you there? So it -- it -- it realized that like it didn't start anything and like interrupt and it was like, hey. There's a tool call going on.

Yeah. Yeah. So the thing is that like I didn't render out the tool calls in here, right? So like what basically happened between this And this was like it started off a tool call and then that tool call because I interrupted it then It stopped that tool call because I interrupted it.

It stopped the generation. It also reset that like transcript here. It's a good indicator that like the interruption happened. And so when I said this it remembered it was trying to call a tool did that tool call And then gave you back the response So that's all the just the the regular real-time API Any any other questions around this?

Yeah What's the cost per minute? What's the cost per minute? We charge per token. There's some Translations. I don't know a nuke. Do you have the All right, so it's more expensive than TTS and And like speech attacks chained up with a model something like in most case, but it depends on the use case And sort of like real model choices and stuff So a bit harder to say like what the per minute pricing is because again, it's by tokens And it also depends on like if you have like transcription turned on and like how many function calls you have and things like that Because it's a mix between audio and text tokens So one of the interesting things and this is not a thing in the in the regular API This is a agents agents as dk specific things Is guard railed so like the agents sdk both in python and typescript Has this concept of guard rails that can either protect your input or your output To make sure that like the agent is not being meddled with or does things that are against policy We took that same pattern and moved it over to the real-time side Um, where essentially we're running these guard rails that you can define In parallel on on top of the transcription at all times you can kind of specify you can see it at the bottom here like How often you want to run them or if you only want to run them when the full transcript is available?

But this is a great way for you to like make sure that the Model for doesn't violate certain policies you want to make sure that these run As efficiently as possible because they're still running in the client But this is a good way to stall fix or like stick to certain policies and if it violates those It will interrupt it now there is sort of the bit of the caveat a bit of a caveat where because we're running this on a transcript It Results in a bit of a timing aspect where if it if it would violate Your guard rail in the first couple of words Chances are it will say those first couple of words And then get interrupted if it is Happening at a later point in time um, then The transcript will be Complete or like the text output is not really a transcript the text output will be done Before the audio is done speaking every like is done saying everything and so In that case it will just correct itself So in that case like to give you an example of like this is a guard rail that just checks like is there the word dom In the output In this case like if I would ask it like hey, please call me dom Chances are it will call me dom and then self-correct If I tell it to tell me a story and only introduce dom in the second act Then it will um catch that It was trying to do that in a much at a much earlier point because that transcript is going to be Done before the audio is being complete before the audio is being read out to the user So the user will never hear dom But instead the model is going to be like, okay.

I'm sorry. I I couldn't help you with that Like let's do something else instead and you can give it um Policy hints essentially on like why It violated this policy or What it should do instead so you can give it these this like output info where you can inform the model like Why this happened No, um, you can choose what uh, so the question was is the transcript still done with whisper you can you can switch the transcript models So we have we released in march two models one is gpt for mini transcribe and With two right yeah, and gpt for gpt for transcribe I was trying to remember like we have only one text-to-speech model, but we have two transcribe models for transcribe models Awesome Awesome, um This is the main part of what I wanted to walk with you all through so um One I'm gonna Post all of the links and the slides in the slack channel which let me go back to Where was that slide?

There we go Um, so in that slack I'm gonna post all of the Resources so if you want to check them out afterwards I'll also like I already put a bunch of the resources that I talked about Into the bottom of that starter repository, so you should have access there as well And I'm happy to hang around answer any questions Yes What is broad casting of the voice?

My understanding is for text for the 90s, broad casting of the neighborhood, it goes to the same font relation Yeah, the question was around how prompt caching works um and sort of whether we like for prompt caching that like we guide Yeah, the question was around how prompt caching works um and sort of whether we like for prompt caching that like we guide The requests to the same um um same system again to run and whether like there's any control With that with real time because latency obviously matters um I don't think there's any controls about that um No, I'm getting a no from there, so like I don't think there's any any controls around that right now Yes Hey, so we all know that having natural conversations is involves more than just spoken words it involves detecting emotion and adjusting tone um it also involves a cadence and Even humming to let the other person know that you are listening I wonder if the current speech to speech model is capable of having that kind of natural conversation Um part of this is like a prompting challenge so like it definitely can have pretty natural sounding conversations and like I think this is sort of the The part where I highly recommend to check out the openai.fm page um because this is like it's it's pretty interesting to see sort of if we go to um let's see if we find the um Actually, I didn't show one neat feature That I normally call out on the playground if you're just getting started with real time and you don't even want to write any line of code Um, this is a great way to just have conversations and try things out But one of the things that has is this It has a couple of system problems one of my favorite ones to show sort of this is the bored teenager Um, so if we start this Hey there, um, so i'm at ai engineer world's fair and everyone is super stoked about voice agents Can you show me some excitement of this whole thing launching today?

Let's see I guess it's cool Whatever there's always new stuff launching People get excited, but you know It's just voice agents Not really my thing to get all hyped up about it So you can see in this case like it it put its own pauses in there and stuff like this wasn't a pause because the model was waiting right like Um, it can deal with a lot of that sort of adjusting tone and voice and they can do similar things Like reacting to like someone talking in and stuff Okay, thanks.

You're welcome Yes So again on the new uh api that's been released is what else has changed so managed it's improved is there anything else that's improved in terms of tone detection or voice? Um, we have not released any new vat models We primarily released like a new base model for or like a new model for the um gpt4 real-time model that is just better at function calling and has been Overall well received from our alpha testers Yes Can you inject Can you inject audio as a background audio like ambient audio like typing audio and so on?

Um, you can like Basically like you're in a real office Right, yeah, um, you can just intercept the audio that is coming back from the model and then like overlay your own audio Oh Yeah, hi, um, I was wondering what your support is for like multiple speakers if there's more than one person in a conversation can it detect who's talking and and do pauses that way?

There's no no current like speaker detection in the model Um, so it might struggle with that Yeah Um I just wanted to ask um about like custom voices is it limited to the preset voices you have or can I upload my voice as a sample for instance? It's currently limited to the voices that we have we keep adding new voices though Is there going to be support to add custom voices anytime in the future?

Um At any time in the future? Not like not like years What I mean what I can say is like we're trying to make sure we're finding like the safest approach Like we have an article on online that talks about sort of like the responsible take we're trying to take on this On making sure that like the voice the voice like if we're providing custom voices that it comes with the right guardrails in place and stuff To avoid abuse all right.

Thank you. You're welcome. Yes Is there still a 30-minute session? And if so, what is the recommendation? I don't think that has changed um to my knowledge My personal recommendation would be that like one of the things that you can do And this goes back to like for example in the demo that I was showing Um if you're keeping track of the transcript and stuff You can read like when you're starting a new session You can populate that context by creating new items using the api So like one of the things that you could do is starting a new session if you know what the previous context was because you kept track of it You can then um basically inject that as additional context Um, what type of event are you looking for?

Oh, if there's a timeout event, um I do not know right now But all of our events are documented in the api reference Yeah, so when you see a real-time api can code tubes and function calls Uh, really includes like system file reading and writing in those packages? Can you sorry, can you repeat that question one more?

Oh, I was wondering if a real-time api can use function call uh functions such as system file writing or reading Oh, uh whether the function calls can do things like system file reading and stuff I would say it depends on where you're running that a where you're running that Real-time session so if you're running it on the server you can do anything you can do on the server If it's if it's running in the browser Then you're limited to whatever things are available in the browser Um, no, you should be able to like you could create a like web socket based like um Voice agent that runs On on your device.

I mean like it's going to use the real-time api for the model But then like because the actual tool calls on goal get executed on your system You should have access to whatever aspect of your system your program has access to Cool. Yes, so even before we get to voice agents we all need bigger better and more diverse evaluation sets Especially for anything around function calling and parameterization Do you have any best practices or suggestions for how we now take evaluation into the voice world?

Should we keep things in text and then turn them you know use text to speech to have voice versions of it? Just if you have any suggestions for how we Evaluate the the full range of inputs that we would expect users to bring to this I mean one of my suggestions would be if you can go to the leadership track go to a noops talk and onwards tomorrow.

I think yeah He's going to talk a lot more about like additional best practices of what we've learned um in Building voice agents. I would say like if you can hold on to the audio like it. It's helpful Obviously transcriptions definitely but like it's sort of like the audio is still the thing that is like The most powerful thing especially for speech-to-speech models where you have The model act on the speech not on the text right and like this is one of the one of the few things where like the Chained approach obviously makes some of this much more approachable because if you have if you're If your agent is running on text Anyways, and you can just store the text and rerun it that makes that part of evals a bit easier That makes sense and then also for those of us who might be thinking about launching a new voice agent How would you suggest evaluating it before we get to that stage that we'd have customer interactions to work with?

um, I would start with like having like a thing goes back to like Like some of this is like There um like human review is an excellent solution for this right so like have it like have A system like like one of the big things with like things like lemonade and stuff Like they're able to go through all of these calls and like get an idea, but they also have their own like predetermined set of Examples that they might want to test as they're developing the agent So like that's a great first way I would scope clearly the problem you're trying to solve as well Like if you're if you're trying to sort of boil the ocean it makes a lot of this significantly harder as opposed to like Well scoping what the agent should be able to do and what it shouldn't be able to do Makes sense.

Thank you. You're welcome Get getting on the same path of our friend over there Actually, when I'm testing my agents my text agents conversational I use prompt full It's a platform for do all the proper testing could I put another agent? To talk with this this agent to do all the evaluation I think you can try it um Like I don't think it shouldn't work Like so I put another voice agent talking to that agent to try to execute all the prompts and then I could get like the transcription Yeah, I mean I mean, I know we have we have use cases where customers also use our models to like prompt humans right so Um, where it's like for for like training use cases or other things for example So she'll work out, but I don't know if anyone uses that kind of approach and lemonade does Oh cool.

So the second the second picture is is exactly that awesome. Thank you. You're welcome Any any any other questions Yeah, go ahead slightly related uh, do you have something around wake word detection on the real-time api roadmap or patterns for wake words? Oh reports, um, no wake words. So that'll wait activating the No, we don't have any wake words built in or anything No patterns either like any patterns to avoid costs Um, no one device.

No You could basically like what you can do is you can build your like you can turn off our voice activity detection and then build your own um, and then basically use that so like you could use a model that has like a like a vad voice activity detection model that has wake words in it and then like Do it that way and basically commit all of that audio to our api and then send like a commit like a commit event cool Awesome.

Thank you so much for taking the time and spending the afternoon with me Thank you Thank you We'll see you next time.

Building voice agents with OpenAI — Dominik Kundel, OpenAI

Chapters

Transcript