Building voice agents with OpenAI — Dominik Kundel, OpenAI

00:00:00.000 | -

00:00:02.000 | - Awesome.

00:00:16.240 | Well, hi everyone.

00:00:18.040 | My name is Dominic.

00:00:19.520 | I work on developer experience at OpenAI

00:00:21.720 | and I'm excited to spend the next two hours

00:00:23.700 | to talk to you all about voice agents.

00:00:25.960 | The QR code was already on the slides.

00:00:29.240 | If you just entered the room,

00:00:31.540 | you want to try to download the dependencies

00:00:34.960 | as soon as possible.

00:00:36.880 | So head over to that QR code or to that starter repository

00:00:39.920 | and follow the instructions to install.

00:00:41.920 | That might take a while with the internet right now,

00:00:44.120 | so please do that as soon as possible.

00:00:46.220 | You have like 15 minutes of me rambling about stuff

00:00:49.300 | before we get started with actually coding.

00:00:51.560 | So I said we're going to talk about voice agents.

00:00:55.320 | I want to first put everyone on the same page

00:00:57.160 | because I know we all have different definitions of agents

00:01:00.840 | and there's going to be a lot of definitions flying around

00:01:02.700 | at this conference naturally.

00:01:04.720 | So when we're going to talk about agents,

00:01:06.160 | we're talking about systems that are going to accomplish tasks

00:01:09.280 | independently on behalf of users.

00:01:11.140 | And most importantly, they're going to be essentially

00:01:14.300 | a combination of a model that is equipped

00:01:16.140 | with some set of instructions that then has access to tools

00:01:19.900 | that can both be used to work on that goal.

00:01:23.280 | And then all of that is encapsulated into a runtime

00:01:25.840 | to manage that life cycle.

00:01:27.900 | And that's an important definition because today we launched

00:01:32.020 | the OpenAI Agents SDK for TypeScript.

00:01:34.420 | If you've heard of the one for Python,

00:01:35.940 | today we basically released the TypeScript equivalent.

00:01:40.700 | And so we're going to use that, and it maps those exact patterns.

00:01:44.700 | So if you're unfamiliar with the Agents SDK,

00:01:46.720 | it's basically a SDK that provides you with an abstraction

00:01:51.700 | based on the best practices that we learned at OpenAI to build agents.

00:01:57.220 | And it comes with a couple of different base foundational features,

00:02:00.880 | including things like handoffs, guardrails, streaming input and output,

00:02:04.620 | tools, MCP support, build-in tracing so you can see actually what your agents did

00:02:09.760 | and how they interacted with each other.

00:02:11.940 | And then additionally to those features that are coming from the Python SDK,

00:02:15.600 | the SDK we launched today in TypeScript also includes human-in-the-loop support

00:02:20.060 | with reasonability so that if you need to wait for human approval for a while,

00:02:24.980 | you can deal with that.

00:02:26.100 | And most importantly, native voice agent support.

00:02:29.180 | What that means in practice is you can use those same primitives

00:02:33.040 | that we already have in the Agents SDK,

00:02:35.180 | and you can build voice agents with that that handle handoffs,

00:02:37.840 | have output guardrails to make sure that the agent is not saying things

00:02:41.440 | it's not supposed to, tool calling, context management,

00:02:44.780 | meaning keeping track of the conversation history

00:02:47.380 | so you can use it in other applications,

00:02:50.280 | and build-in tracing support so that you can actually replay conversations,

00:02:55.200 | listen to the audio of the user and properly debug what happened,

00:02:59.360 | plus native interruption support.

00:03:01.000 | If you've tried to build interruptions, you might know how hard this is.

00:03:04.140 | If you haven't, be glad you don't have to.

00:03:07.980 | Both WebRTC and WebSocket support, meaning it's actually --

00:03:11.780 | can communicate both on the server for things like Twilio communicate --

00:03:16.940 | like phone call, voice agents, or directly in the client in the browser.

00:03:22.520 | That's what we're going to use today using WebRTC.

00:03:26.700 | But first, why would we be interested in voice agents in the first place?

00:03:33.680 | One of the things that I'm most excited about is it makes technology much more accessible to people.

00:03:38.420 | There's something magical about being able to talk to a voice agent and just have it --

00:03:44.020 | kind of like see it do things.

00:03:46.620 | It's also much more information dense.

00:03:50.100 | I can convey information much faster, but also it can contain a lot of information

00:03:55.340 | through the type of tone and voice that I'm using, the emotions.

00:04:00.160 | So it's much more information dense than sort of just basic text is.

00:04:05.480 | One of the cool things is also it can act as like an API to the real world.

00:04:09.020 | You can have a voice agent go and like call a business for you and like have a conversation

00:04:14.120 | with them where maybe there isn't an API for that business.

00:04:20.760 | And so when we talk about building voice agents, there's essentially two types of architectures

00:04:24.940 | that have emerged when building these.

00:04:28.100 | The first one is based on your traditional text-based agent and just sort of wrapping it into a chained

00:04:34.200 | approach where we have a speech-to-text model that is taking the audio and then turning it into text

00:04:40.480 | so that we can run our basic text-based agent on it.

00:04:43.780 | And then we take that and we run that agent, take the text output, and run it through a text-to-speech model

00:04:51.920 | to generate audio that we can play again.

00:04:55.220 | This has a lot of strengths.

00:04:56.500 | One of the most common reasons why we raise this is it's much easier to get started with if you already

00:05:04.340 | have a text-based agent.

00:05:05.940 | You can take that, wrap some audio around it, and you have something that you can interact with.

00:05:11.580 | But the other aspect is that you have full access to any model.

00:05:16.380 | Text is the main modality that any LLM has, and so you can use really any of the cutting-edge models.

00:05:23.860 | It also gives you much more control and visibility of what the model did by being able to actually look into the exact text that went in

00:05:31.700 | and out of the model.

00:05:33.300 | But it also comes with some challenges.

00:05:35.100 | Turn detection is one of the big ones where you need to now take into consideration what did the user hear

00:05:42.740 | by the time that they interrupted the voice agent, then translate that part back into text.

00:05:49.820 | Make sure that your transcript is appropriately adapted so that the model doesn't think it told the user something that it didn't.

00:05:57.820 | Chaining all of these models together adds latency on every possible level, and so that's another big challenge.

00:06:05.660 | And then you're losing some of that audio context, right?

00:06:08.540 | You're transcribing the audio, and if you've ever tried to convey a complicated topic over a text,

00:06:14.940 | you know it's a bit harder than dealing with the same thing using your own voice.

00:06:20.380 | So, an alternative to that chained approach is a speech-to-speech approach where we have a model that has been trained on audio,

00:06:29.980 | and then takes that audio to directly interact on the conversation and make tool calls, meaning there's no transcribing in the process.

00:06:40.300 | The model can just natively deal with that audio, and that translates into much lower latency because we're now skipping those speech-to-text, text-to-speech processes.

00:06:50.300 | We can also now have much more contextual understanding of the audio, including things like tone and voice.

00:06:58.060 | And all of that leads to a much more natural, fluid level of conversation.

00:07:03.420 | But there are some challenges with this.

00:07:04.860 | One of the most common ones is reusing your existing capabilities.

00:07:08.460 | Everything is built around text, so if you already have some of those existing capabilities or a very specialized agent for a certain task,

00:07:16.300 | it's harder to reuse those.

00:07:18.300 | Also, dealing with complex states and complex decision-making.

00:07:22.140 | It's a bit harder with these models, since they've been really focused on improving on the audio conversational tone,

00:07:29.020 | less so on being very complex decision-makers.

00:07:33.740 | But there is a solution that we can get around with this.

00:07:37.180 | Again, taking inspiration from what we do with text-based agents, we can actually create a delegation approach using tools,

00:07:45.820 | where we have a frontline agent that is talking continuously to the user, and then that one uses tool calls to interact with much smarter reasoning models like 04 mini or 03.

00:07:57.020 | Actually, let me, at this point, give you a quick demo and see how the internet goes here.

00:08:04.860 | So I have a real-time agent here that we built with the Agent SDK.

00:08:08.700 | It's going to be very similar to what you're going to build later on.

00:08:12.620 | But when I start talking to this, hello there.

00:08:17.260 | Hello there.

00:08:19.260 | How is the AI engineer at World's Fair going?

00:08:23.260 | So we can now give it, like, give a task to, like, call tools that I gave it, like, "Hey, what's the weather today?"

00:08:31.900 | "Let me check the weather for you. One moment, please. Transferring you to the weather expert now.

00:08:40.540 | Actually, I can directly help you with the weather information. Could you please specify the location you're interested in?"

00:08:46.540 | Oh, yeah. What's the weather in San Francisco?

00:08:50.540 | So you can see here it's actually dealing with the interruption.

00:08:53.900 | Enjoy the bright and pleasant day. Is there anything else I can assist you with?

00:08:58.940 | No, that's it. Thanks.

00:09:01.980 | You're welcome. If you need anything else, feel free to ask. Have a great day.

00:09:06.940 | And so in a similar way, we can actually trigger the more complicated back-end agents as well.

00:09:11.900 | So I have a tool for this to handle refunds that will call out to 04 mini and evaluate the refunds.

00:09:18.060 | So hey there. I have one more thing. So I recently ordered this skateboard that I tried to use,

00:09:25.180 | and it seems like I'm really bad at skateboarding. So I want to return it. It is slightly a scratch, though.

00:09:31.180 | I'm here to assist, but it sounds like you need customer service for that. I recommend contacting

00:09:40.300 | the company where you bought the skateboard. They can provide you with the return.

00:09:44.860 | Looks like I didn't add the tool. Maybe I did. Oh, I didn't ask for a refund.

00:09:51.740 | Let's try this once more. Hey there. I bought a skateboard recently that I tried,

00:09:59.260 | and apparently I'm really bad at using it. So I wanted to return it. It is slightly scratched,

00:10:04.220 | so can you give me a refund? Hello there. How is the AI?

00:10:11.340 | Hello there. How is the AI engine?

00:10:13.340 | The joys of the joys of internet. Hey, I recently ordered a skateboard from you, and it

00:10:22.540 | failed. Like, I can't use it. I'm struggling to use it. It's slightly scratched. Can you give me a refund, please?

00:10:30.460 | Hello there. How is the AI? I'm going to assess your request for a refund.

00:10:38.940 | There we go. Let's get started. It is slightly struggling with this, like,

00:10:41.820 | weird echo that we're having here. The skateboard arrived damaged, and you're eligible for a full refund.

00:10:48.380 | We'll process that for you. All right. But you can see here that it was able to call that more advanced tool

00:10:56.300 | and actually process that request. And one of the nice things is that, like, while time to first token

00:11:04.380 | is often a really important thing, the longer a conversation goes, like, your model is always going

00:11:09.980 | to be faster than the audio that has to be read out. And so this is, like, a really helpful thing where,

00:11:15.900 | by the time that the model was able to say, like, hey, I'm going to check on this for you, it already had

00:11:22.540 | completed that LM call to the 04 mini model to get the response there. All right.

00:11:29.660 | Let me -- oh, one more thing. Since we talked about traces, one of the nice things now is we can

00:11:37.340 | actually go back here into our traces UI. And this launch today, you'll be able to actually look for any

00:11:44.380 | of your real-time API cases, look at all the audio that it dealt with and all the tool calls. So we can

00:11:50.140 | actually see here that the tool call was triggered, what the input was, the output. We can listen to

00:11:56.620 | some of the audio again to understand what happened. And then because both this and the back-end agent

00:12:05.580 | use the agent's SDK, we can go into the other agent as well, which was the 04 mini one, which we can see

00:12:11.660 | here. And we can see that it received the context of the full past conversation, the full transcript,

00:12:17.180 | as well as additional information about the request, and then generated the response here. So this allows

00:12:24.380 | us to then get a full, complete picture of, like, what happened both in the front-end and the back-end.

00:12:28.620 | Let's jump back into the slides and cover a couple of more things before we get coding.

00:12:34.940 | And that's about best practices. So I would group the best practices of, like,

00:12:41.500 | building a voice agent into three main things to keep in mind. The first one is to start with a small

00:12:47.580 | and clear goal. This is super important because measuring the performance of a text-based agent -- you

00:12:53.740 | will hear a lot about evals at this conference -- is already hard enough. But with voice agents,

00:13:00.220 | it's going to be even harder. So you want to make sure that you're very focused on, like,

00:13:03.580 | what is the first problem you want to solve and keep it focused on that and give it a

00:13:08.540 | like limited number of tools so that you're fully centered on this. The agent's SDK makes this really

00:13:13.500 | easy because you can then later on add additional tools to additional agents and deal with, like,

00:13:19.180 | handoffs between them. But this way you can kind of really stay focused and make sure that one of your

00:13:25.500 | use cases is great and then hand off other ones to human agents, for example. The second one is

00:13:32.540 | what I elaborated on, which is building evals and guardrails very early on so that you can feel both

00:13:38.700 | confident in what you're building but also confident in that it's actually working so that you can then

00:13:46.060 | continue to iterate on it and know when it's time for you to, like, grow the complexity of your voice agent.

00:13:52.700 | As of today, you can use the traces dashboard for that. But alternatively, some of our customers

00:13:59.980 | have even built their own dashboards like Lemonade to really get an end-to-end idea of the customer

00:14:05.420 | experience and then even replace some of these conversations with their agent as they're iterating on

00:14:10.620 | it. The other thing that I'm personally super excited about with these models is both our speech-to-speech model

00:14:17.900 | and our text-to-speech model are generative models, meaning you can prompt them the same way that you

00:14:22.780 | can prompt a LLM around tone and voice and you can give it emotions, roles, personality. We built this

00:14:31.580 | little microsite called openai.fm. It's a really fun website to play around with where we have a lot of

00:14:37.340 | examples of different personalities and how that style of prompt can then change what is being read out by our

00:14:44.940 | text-to-speech model. And so that's a great way for you to not just limit one second, limit the experience

00:14:52.780 | of your model or, like, the personality of your model by the voice that you picked, but also by the prompt

00:15:00.300 | and instructions that you're giving it. That was a question there. Would you mind using the mic that is

00:15:05.020 | right behind you just so that it's on the recording? Hello, sir. So my question regards to the previous

00:15:14.380 | slides on Lemony. So you're displaying how they have this dashboard where they can show all of this.

00:15:21.500 | Is this a dashboard that OpenAI provides and Lemony just integrates as, like, an iPhone or something?

00:15:27.900 | No. So in this case, they built their own solution for it. Okay. And does OpenAI then provides all the

00:15:34.540 | JSON or the data structure that we can just plug into the... So the way the real-time API under the hood

00:15:40.460 | works is that you get all the audio data and you can do whatever you want with that, basically. You're

00:15:45.820 | getting all the necessary audio events so you can use those data structures. So we're not storing them by

00:15:50.540 | default. You can use the Traces dashboard. We don't have an API for it yet, but you can use the Traces

00:15:57.580 | dashboard to get a basic look of that, but it's not iframeable. But you mentioned it's only audio data.

00:16:06.220 | This shows not just audio, but also the transcription and all of that as well, right? So the Traces dashboard,

00:16:11.820 | if we go back to it, does show all of the transcripts and stuff as well, as long as you have

00:16:20.860 | transcription turned on, which I don't seem to have turned on for this particular one. But it should,

00:16:29.340 | like, you can turn on transcription and you should be able to see the transcripts as well.

00:16:32.860 | Okay. Thank you. You're welcome.

00:16:37.980 | All right. Let's go back to this. The other part with it is, as I said, you can prompt both the

00:16:47.100 | personality. You can also be very descriptive with the conversation flows. One of our colleagues found

00:16:53.340 | that giving it conversation states instead of this JSON structure is a great way to help the model

00:17:00.060 | things through sort of what processes and what steps it should go through the same way that you would give a

00:17:05.340 | human agent a script to operate on. If you're struggling to write those scripts, though,

00:17:11.260 | we also have a custom GPT that you can use to access that. And I'll share all of those links and a copy of

00:17:16.380 | the slide deck later on in the Slack channel. So if you're in that, you should be able to access those.

00:17:21.660 | But with that, that's primarily what I wanted to talk through from a slides perspective. So from here on,

00:17:32.460 | what I want to do is build with you a voice agent. We'll see how that goes with the internet.

00:17:37.500 | Also, if you have headphones, now is a great time to bring them out. It's going to be really weird

00:17:42.700 | when we're all going to talk to our own agent. But we're going to try this and see how that goes.

00:17:48.380 | So if you came in later, please scan the QR code. Go to that GitHub repository and set that up. Install the

00:17:57.340 | instructions. There's no code in it yet other than like a boilerplate Next.js app and a empty

00:18:03.660 | like package JSON that install just like the dependencies that we needed so that we are not all trying to run

00:18:10.780 | npm install at the same time. But what I want to do is build a first agent. So if you want, you can just

00:18:20.540 | straight up copy the code that is on here. But I'm going to actually go and type it along with you all so

00:18:29.420 | that you get a feeling for what's happening. And we have a good idea of timing. So if you want to take a

00:18:36.300 | picture now, just code ahead. Do that. And otherwise, I'm going to switch over to my code editor and we're

00:18:41.260 | going to do this together.

00:18:44.860 | So if you're running into trouble, the Slack is a great way to post questions that are technical

00:18:52.380 | questions. And Anoop, who's over there, is going to try to help you. Alternatively, raise your hand,

00:18:59.340 | but it's a bit easier if you're just slacking the messages there and we can kind of multi-thread the

00:19:05.500 | problem. All right. Let's go and build an agent. So if you clone the project, you should see an index

00:19:14.460 | index.ts file. Go and open that and you should be able to import the agent class from the openai/agents

00:19:25.500 | package. That's what we're going to use to create the first agent. Yeah? Oh, yeah. Good call.

00:19:43.900 | Is that better? Cool. That seems a bit -- seems worse on my side than yours. But I think as long as you

00:19:53.020 | all can read that, I'll be fine. All right. So what I want you to do is go and import an agent. And we're

00:19:59.180 | going to define our first agent. And as I mentioned, primarily, an agent has, like, a few centerpieces.

00:20:07.100 | The first one being a set of instructions on what to do. So we can give it instructions. I'm going to say,

00:20:12.940 | you're a helpful assistant. It's sort of the most boilerplate thing you can do. We do need

00:20:17.420 | to also give it a name. And that's so that we can actually keep track of them in our traces dashboard.

00:20:22.780 | I'm going to say my agent here. This can be anything that helps you identify it.

00:20:28.220 | And then we need to actually execute this agent. So we can import a run function here.

00:20:36.220 | And then we can await the run here. I'm going to run this agent with just hello, how are you? And then

00:20:47.900 | log out the results. And with the results, we get a lot of different information. Because essentially,

00:20:54.540 | when we run an agent, it's going to do a lot of different tasks, from executing all necessary tool

00:21:00.300 | calls, if there are any, to validating output guardrails, etc. But one of the most common things

00:21:05.740 | that you just want to log out is the final output. That's whatever the last agent in an execution said.

00:21:13.180 | So in this case, it's going to be a set of text. And then you should be able to run npm start,

00:21:19.820 | npm run start 01. And that should execute it. And then you should see something like this,

00:21:33.340 | depending on what your model decides to generate. And by default, this is going to run GPT 4.1

00:21:39.500 | as the model. But if you want to experiment with this, you can set the model property here.

00:21:45.420 | And we can set it to

00:21:47.500 | 04 mini, for example, and then rerun the same thing. So this is the most basic agent that you can build.

00:21:56.620 | But one of the things that really makes something an agent is if it can execute tools.

00:22:01.900 | So we can import a tool here. There we go. And we can define a get weather tool.

00:22:10.540 | One of the things here is you have to specify what arguments the model is going to receive. And one

00:22:22.140 | of the ways that you can do this is through a tool called Zod. If you've never heard of it,

00:22:26.380 | it's essentially a way to define schemas. And what we'll do is we'll both

00:22:31.660 | use that Zod schema to inform the model on what the parameters for this function call are.

00:22:37.020 | But we're also going to use it to validate then what are the

00:22:39.900 | actual arguments that the model tried to pass in and do they fit to that schema. So we get

00:22:45.340 | full type safety here. If you're a TypeScript developer and you care about that.

00:22:49.420 | So in this case, we have a get weather tool. And then we can give that tool

00:22:55.340 | to the agent and we can change this to what is the weather in Tokyo is what cursor wants to check.

00:23:02.700 | So if I run this again,

00:23:04.620 | let me move this slightly. We can see it's going to take a bit longer now. And that's because it ran some

00:23:11.900 | tools. And now it's telling me the weather in Tokyo is sunny. And if you're wondering, well, did it actually run a tool?

00:23:19.820 | We can go into our traces dashboard here

00:23:27.580 | and look at the trace. We have a my agent here.

00:23:35.180 | And then there we can see it ran, tried to call the tool, executed the tool and got the weather in Tokyo's sunny back,

00:23:45.420 | and then took the response to generate the final response.

00:23:50.380 | So the traces dashboard is a great way for you to see what actually happened behind the scenes.

00:23:54.460 | How are we feeling? Can I get a quick temperature check? Are people able to follow along? I see Zeke is

00:24:02.780 | giving a thumbs up there. So this is a text based agent. I wanted to show you this just to get a bit

00:24:10.460 | familiar with the overall agents SDK so that we can jump into building voice agents.

00:24:18.060 | The first thing we need to understand about a voice agent is the slight differences between

00:24:25.420 | a voice agent and what we call a real-time agent. Essentially, a real-time agent is just a

00:24:31.260 | specialized version of an agent configuration. There's just a few fields you can't pass in.

00:24:36.540 | But they can be used in what's called a real-time session. Because with voice agents,

00:24:41.180 | there's a lot more things to deal with than just executing tools in a loop.

00:24:45.900 | One of the most important things is you need to deal with both the audio that's coming in,

00:24:49.500 | process that, and then run the model with that, and then deal with the audio that's coming out.

00:24:57.340 | But you also need to think about things like guardrails, handoffs, other lifecycle things.

00:25:02.540 | And so the real-time session is really dealing with all of that.

00:25:06.140 | So let me show you how that works. For this, what we're going to do is we're going to go in the same

00:25:13.020 | project. There's a 02. And it has a page TSX in there. This is a Next.js app that really,

00:25:23.900 | I just gutted to have the bare minimum in there. But this is a great way for us to just build both the

00:25:30.060 | front-end and the back-end part of the voice experience. Because this voice agent that we're

00:25:37.420 | going to build is going to run in the browser. In order to make sure we're not leaking your API

00:25:41.980 | credentials, one of the important things is you need to use an ephemeral key. That is a key that is

00:25:47.820 | short-lived and is going to be generated by your server and handed off to your client so that they can

00:25:53.100 | use that to interact with the real-time API over a protocol called WebRTC. For that, you should see

00:26:01.100 | a token.ts file in your repository that just calls out to the real-time API to generate a session and

00:26:09.180 | then return a client secret, which is that ephemeral key that we can then use to authenticate with the

00:26:14.060 | SDK. You do not have to do this if you're building a real-time agent that is running on your server. For

00:26:20.540 | example, in the case of Twilio app or something else where you can just directly interact with the

00:26:27.740 | OpenAI API key. But if you're running anything in the browser, then you actually need to generate

00:26:34.780 | this client key just so that you're not, you know, giving your API key to the world. So with that

00:26:41.900 | in here, we can actually go and build our first real-time agent. So similar to previously, we're going to

00:26:48.460 | import an agent class here. But in this case, it's going to be a real-time agent.

00:26:53.820 | We're going to import it from the real-time package,

00:26:59.180 | which is just a sub path in the same package. So you don't need to install a different package here.

00:27:05.260 | But now we can define a real-time agent that works the same way. We have a name. We give it instructions.

00:27:13.980 | Just sort of going with a default suggestion here. And now we actually need to connect that agent to a

00:27:20.220 | real-time session. So I have a connect button here for running this example. Let me start it up here

00:27:26.540 | with npm run start 02. That command should be in your readme as well.

00:27:33.740 | It's going to start a development server. And we can go over here, reload this. And you can actually see

00:27:39.820 | it just has like a little connect button that right now doesn't do anything.

00:27:46.460 | So let's connect that up. I don't need this anymore. Let me just move this to the side.

00:27:53.180 | So in this on connect function that gets triggered whenever we press the button,

00:28:00.860 | we want to deal with that connected state. So what we're going to do here is we're first going to fetch

00:28:08.140 | that token. And this code, what this basically does is it's going to import that server action,

00:28:14.540 | which is the next JS concept that just makes sure that like this code is going to run on your backend.

00:28:19.980 | If you're using a different framework, you should be able to just go and fetch this key from your

00:28:25.420 | backend server. And then once we have that token, we can go and create a new real-time session.

00:28:33.580 | So what we're doing here is we're going to give it the first agent that should start the conversation up.

00:28:39.020 | I'm going to specify the latest model that we released today along with the agent's SDK.

00:28:45.420 | This model is a if you've used the real-time API before, it's an improvement, especially around tool

00:28:50.860 | calling. It's much better on that front. We have a couple of different customer stories on our Twitter,

00:28:56.780 | if you want to check that out. And then I'm going to give it not there. I don't know my

00:29:03.420 | cursor insists on that. The last step that we need to do is we need to connect to that session.

00:29:08.540 | So this is where we're going to give it that API key so that we can connect to the real-time session

00:29:16.940 | under the hood. Just so that it's easier for us to deal with all of this, I'm also going to close

00:29:22.940 | the session. But I've got one thing here that is an oddity of React. We do not want to generate that

00:29:31.580 | session every time. So I'm going to on every re-render. So I'm going to create a what is

00:29:38.060 | called a ref here. Again, if you're new to React, this basically is just a variable that's going to

00:29:44.620 | persist through re-renders. So we need to slightly change this here. We're going to assign that to

00:29:49.740 | session.current so we can maintain that. And then that also allows us to say if there is a session.current

00:29:57.420 | set, we want to actually close that connection when we press the disconnect button. That just

00:30:02.460 | makes sure that we're disconnected from the audio again. So I'm going to leave that on the screen for

00:30:09.660 | a second, and then we can test this out. But if you already typed this, go into your browser, refresh,

00:30:15.260 | press connect, and you should be able to talk to your agent.

00:30:18.140 | All right, let's try mine. Let me move this to the other side so it's not blocking your code.

00:30:37.900 | Hello? Hi there. How can I assist these days? All right. So you can see it. It's just a few lines

00:30:46.940 | of code. We didn't have to deal with things like figuring out how to set up the microphone, how to

00:30:51.420 | set up the speakers. By default, if it's running in the browser, it will deal with all of that

00:30:56.380 | automatically. If you do want to pass in your own microphone source or other things like that,

00:31:00.060 | you can do that as well. If this is running on a server, you have both a

00:31:08.380 | send audio function that allows you to send an audio buffer in, or you can listen to the audio

00:31:18.380 | event, which is going to emit all of the audio buffers that are coming back from the model so that

00:31:25.420 | you can pass it to whatever your source is. So that's our first basic agent. Any questions so far?

00:31:35.020 | Please update the Rappo. Can you send that code to the Rappo?

00:31:42.780 | You want you to send the code to the Rappo? Can you push it? Can you push it?

00:31:48.780 | I can push it, yeah.

00:31:59.100 | Good call. Thank you.

00:32:05.900 | All right. So now that we have that,

00:32:11.180 | let's go and actually give it a tool. So this is really where the benefit of the

00:32:18.220 | agents SDK comes in. We can actually use that same tool definition that we did earlier. So I'm just

00:32:23.980 | going to follow the autocomplete here. We should be able to just give that tool now to our agent

00:32:31.820 | and save. I need to import Zod again to do that schema validation. This is especially important on the

00:32:41.660 | real-time side because the real-time model currently does not support strict mode. So the JSON

00:32:48.060 | might not fully comply with your schema unless you're giving us a Zod schema and we'll go

00:32:57.740 | and validate that this actually fits that schema. So that makes your code a bit easier.

00:33:03.020 | So with that we can go back. Hey, what's the weather in San Francisco?

00:33:09.500 | The weather in San Francisco is sunny today.

00:33:14.700 | We can disconnect it here. Also, this does now deal with interruption. So,

00:33:21.900 | what's the weather in San Francisco is the weather in San Francisco.

00:33:25.500 | The weather in San Francisco is sunny today.

00:33:29.580 | So the weather in San Francisco is sunny today.

00:33:31.660 | So the weather in San Francisco is sunny today.

00:33:33.660 | So the weather in San Francisco is sunny today.

00:33:35.580 | So the weather in San Francisco is sunny today.

00:33:37.580 | So the weather in San Francisco is sunny today.

00:33:38.940 | So the weather in San Francisco is sunny today.

00:33:40.380 | So the weather in San Francisco is sunny today.

00:33:42.460 | So the weather in San Francisco is sunny today.

00:33:44.380 | So the weather in San Francisco is sunny today.

00:33:45.900 | So the weather in San Francisco is sunny today.

00:33:47.340 | So the weather in San Francisco is sunny today.

00:33:48.380 | So the weather in San Francisco is sunny today.

00:33:49.580 | Normally, that's enough to deal with the context.

00:33:52.780 | But it is super crucial to have that interruption timing.

00:33:56.140 | So that like your model doesn't think it read out like the full customer policy.

00:34:00.300 | But the customer interrupted it halfway through, for example.

00:34:04.060 | All right.

00:34:05.660 | Question. You don't have to manage all the events to actually do that anymore?

00:34:12.060 | No. So the real-time session will handle all of those events.

00:34:16.940 | What we can do is listen to the transport event.

00:34:31.820 | And let's do this later.

00:34:34.860 | This will log out all of the events that are happening under the hood.

00:34:40.460 | So if we open the dev tools here and rerun this.

00:34:45.900 | Hey.

00:34:49.340 | Hey there. How can I help you today?

00:34:52.220 | So you can see all of the events that normally you would have to deal with are being dealt with.

00:34:58.380 | You still have full access to them.

00:34:59.980 | So you can both read them, but you can also send events yourself.

00:35:03.580 | So it's going to handle all of the things.

00:35:05.580 | But continue to pass them on to you if you want to do your own logic on top of that.

00:35:10.700 | I'm going to push that code for you so you can pull it.

00:35:28.700 | Cool. All right.

00:35:32.220 | Since we already have this commented out code, the other part of this that typically is a request

00:35:37.340 | that you want to deal with is I want to show like the transcript.

00:35:41.420 | I want to see what sort of is being transcribed.

00:35:44.140 | And the important thing here is I'm using the word transcribe because even though the speech-to-speech model

00:35:49.180 | is dealing with the audio directly and there is no transcription step in between, by default,

00:35:55.740 | we're going to transcribe all of the conversation at the same time.

00:35:59.580 | You can turn it off if you want to.

00:36:02.620 | If you're using the API directly, you have to actually turn it on.

00:36:06.380 | In the agent SDK, it's turned on by default because it's such a common request.

00:36:11.420 | And it enables us to do a couple of additional features that we'll cover later on.

00:36:16.620 | But this is going to give us that whole history every time.

00:36:20.140 | So I'm just going to log that history here.

00:36:24.060 | Or rather, I'm going to - there we go - import that.

00:36:29.580 | I'm going to set that as a variable.

00:36:34.140 | And then because it's React, we can create a list here.

00:36:40.380 | We're going to go over all of this.

00:36:42.540 | I need to filter because it has both tool calls and messages.

00:36:46.620 | And I only want to show the messages for this.

00:36:48.380 | So I should be able to - why does it want that?

00:36:54.380 | Let's see.

00:36:56.620 | Close this.

00:36:59.020 | Refresh.

00:36:59.580 | Hey.

00:37:01.020 | Hello.

00:37:02.780 | How can I assist you today?

00:37:04.620 | How's the weather today in San Francisco?

00:37:07.660 | The weather in San Francisco today is sunny.

00:37:12.460 | Anything else you'd like to know?

00:37:14.380 | So you're automatically getting that conversation.

00:37:17.420 | If you are interrupting the model, one of the things that happens is the transcript is going to

00:37:22.060 | disappear, and that's because the model currently does not adjust that transcript.

00:37:29.900 | And instead, it's going to be removed.

00:37:31.420 | And we're going to remove it from that object as well, just so that you get the most accurate

00:37:36.140 | representation and you're not thinking that, like, the model read out a certain piece of text.

00:37:40.940 | And again, with everything that we're doing here, we can actually go back into traces.

00:37:46.780 | And we can see that same representation here with the weather call and everything.

00:37:51.100 | So, again, it helps with the debugging.

00:37:53.500 | I'm going to go briefly back to the slides.

00:37:58.220 | So we covered - we set up our first agent.

00:38:00.620 | Yeah.

00:38:02.620 | The question was, how do you store the conversation history?

00:38:09.340 | So it's currently fully stored in memory.

00:38:13.340 | So basically, there is going to be a bunch of events that I logged out that are being

00:38:18.940 | emitted by the real-time API.

00:38:20.460 | All of those are going to be sent over to the client and then stored in memory in a conversation,

00:38:27.180 | like, in just, like, an array, essentially.

00:38:28.940 | So you can do whatever you want with that by listening to that history updated event.

00:38:33.340 | So if you do want to store it somewhere, you can store it.

00:38:36.220 | The other part is the traces part is automatically going to be stored on the OpenAI platform.

00:38:44.300 | As long as you both enable that tracing, you can disable it by default in the agent SDKs enabled.

00:38:52.460 | And then the other aspect of that is if you are a ZDR customer, so a zero-data-retention customer

00:38:58.300 | of OpenAI, you don't have access to that traces feature.

00:39:01.660 | The question was, how much of the, like, voice context, how much of the previous conversation

00:39:23.260 | is being used?

00:39:24.780 | That's going to depend and sort of, like, dealt with directly by the real-time API.

00:39:29.340 | So, like, the real-time API, when you start that session,

00:39:32.860 | that holds the source of truth for that whole conversation session.

00:39:38.540 | So what you're receiving on the client side is just a copy of whatever is happening at that point.

00:39:45.180 | It's not the source of truth of what we're going to adapt to pass into the model.

00:39:49.820 | The question is, how does it work with, like, the inference cost and, like, whether you're

00:40:16.780 | passing it, like, passing in that whole conversation?

00:40:19.740 | And Noop is nodding.

00:40:22.220 | He is the bigger expert there.

00:40:23.340 | But yes, we're actually, like, you can log the, like, we're keeping track of the usage.

00:40:30.380 | There's an event that you can, like, log out to see your token cost.

00:40:33.260 | So you have an idea of, like, what is being actually passed in.

00:40:37.100 | So, like, with every, if we're going back here to this example, you can see these response done events.

00:40:45.820 | I don't know, where is the, shouldn't it be on the response done?

00:40:50.780 | It is being sent over.

00:40:55.340 | I just do not know right now why it's not showing.

00:40:59.420 | Oh, there.

00:41:00.220 | So you can see here, it outputs the detailed information of your token usage at any point in time.

00:41:08.380 | So while you don't have, like, access to, like, what is exactly what was passed into the next

00:41:13.180 | response generation, you can keep track of the cost as it's happening.

00:41:17.900 | You're welcome.

00:41:20.540 | Yes.

00:41:21.820 | There's a microphone right over there.

00:41:23.980 | That might be easier than you yelling across the stage to me.

00:41:27.260 | I see that the format that you're using is PCM 16.

00:41:31.420 | Is there a way in which we can modify the output formats of the audio files so we can save in memory?

00:41:36.700 | Um, yeah, there are different, different audio modes that you can use.

00:41:41.260 | Um, including, like, for example, ULaw for, that is, like, helpful for phone calls, for example.

00:41:49.820 | Another question on the usage.

00:41:52.140 | Does that, like, final assistant response roll up all the tokens from, like, all the

00:41:58.700 | intermittent tool calls as well?

00:42:00.140 | Does that make sense?

00:42:02.220 | Like, the agent needs to, like, kind of reason through and then format tool calls.

00:42:06.300 | So I'm assuming it's not just the output tokens for only the assistant response, right?

00:42:10.860 | It, like, every tool call is a response in general as well.

00:42:17.100 | So, like, it works the same way that, like, the responses API works, for example.

00:42:21.900 | Okay.

00:42:22.620 | So, like, each, right?

00:42:24.620 | Because we're using this and we have, like, tool calls and tool call outputs, right?

00:42:28.300 | And I couldn't find the, like, usage attribute on the tool call output.

00:42:32.140 | Is it somewhere in those, like, raw events that are outputted?

00:42:34.540 | Do you know, Anu?

00:42:36.860 | Okay, no worries.

00:42:38.780 | I know it's, like, kind of early on.

00:42:40.220 | We can follow up.

00:42:40.940 | All right.

00:42:41.340 | Thank you.

00:42:41.820 | You're welcome.

00:42:42.780 | Yeah, do you want to head over to that microphone that is right behind you?

00:42:46.140 | Yeah.

00:42:47.100 | It just makes it a bit easier.

00:42:48.300 | Oh, yeah.

00:42:50.460 | There's, in the meantime, if you want to...

00:42:52.300 | Just a quick question.

00:42:53.100 | Can I go back to the slides explaining the different modes of the audio agents, like,

00:42:59.660 | the text in, text out, that's the first one.

00:43:01.820 | Oh, yeah.

00:43:02.300 | Text to speech, that's the second one.

00:43:04.060 | Oh, yeah.

00:43:04.620 | I didn't get the third one, and...

00:43:06.780 | Oh, you mean this?

00:43:09.500 | Yes.

00:43:11.420 | Yes.

00:43:12.540 | Yes, it's...

00:43:16.780 | When you just showed us the GPT-4 real-time, that one...

00:43:21.660 | Yeah.

00:43:21.660 | Is that...

00:43:22.700 | This PPT...

00:43:24.620 | This slide is about...

00:43:26.140 | Yeah, exactly.

00:43:26.860 | So, like, where it's, like...

00:43:28.380 | When we did the refund, it kind of followed this pattern, where it performs a tool call...

00:43:34.780 | Like, the, like, real-time API agent can perform tool calls.

00:43:39.500 | It performed a tool call to trigger a separate agent that was the refund agent that, in my case,

00:43:45.180 | used O4 Mini to execute that task and then hand that back.

00:43:48.860 | Okay.

00:43:51.100 | Got it.

00:43:51.420 | Thanks.

00:43:51.740 | You're welcome.

00:43:52.460 | Yes.

00:43:53.580 | I'm currently using, like, a regular OpenAI agent.

00:43:59.340 | So, what will be the challenge that we face when we want to change my regular agents to real-time agents?

00:44:07.740 | So, there's a couple of different challenges.

00:44:10.220 | Like, one is, like, anything that you're doing around latency...

00:44:16.140 | Like, anything you're doing around voice, latency is always king.

00:44:21.580 | So, like, you want to figure out what are the best ways to...

00:44:26.060 | I actually have a slide around this.

00:44:27.500 | Like, when it comes to things like tool calling, you want to find ways to do things like buying yourself some time.

00:44:35.420 | So, you will typically see some prompting around, like, announce what you're about to do next before you're doing it.

00:44:42.700 | And that's to do that little trick around while the previous audio is still being read out.

00:44:49.180 | The agent can already perform the tool call and wait for the tool call to come back.

00:44:54.540 | Because, similar to a text-based agent, the model can't do...

00:44:58.620 | Like, can't receive additional data as, like...

00:45:01.180 | Like, do another thing outside of, like, we can interrupt the response, but it can't finish that response, if that makes sense.

00:45:09.260 | And so, you want to do these sort of, like, buying time.

00:45:12.700 | The other thing is, like, if you're building a real-time agent, the longer your prompt gets,

00:45:18.780 | at one point it increases the likelihood that it gets confused.

00:45:21.900 | So, you want to make sure you're properly scoping those use cases and, like, through what we call handoffs,

00:45:29.500 | where you have different agents that are more scope to specific steps in your experience.

00:45:34.620 | Thank you.

00:45:36.220 | You're welcome.

00:45:37.340 | Yes.

00:45:40.460 | Can you speak a little bit more about memory?

00:45:42.300 | Earlier, you said that...

00:45:44.060 | Is that short-term, long-term, such a...

00:45:48.780 | Yeah, so you're...

00:45:51.500 | Yeah, so the question is about memory.

00:45:57.580 | Basically, right now, the...

00:46:01.580 | Let me correct this.

00:46:05.340 | When we go back to this demo, what you're seeing here is essentially just, like,

00:46:11.100 | a copy of the events that we're receiving back.

00:46:14.300 | So, this is, like, helpful as a visualization of the history.

00:46:17.180 | That being said, the actual, like, memory in the sense of, like, an LM agent memory is the

00:46:24.940 | session context that is happening on the real-time API side.

00:46:28.620 | There are events that you can use to update that.

00:46:31.100 | We actually have an update history event that you can pass in what you want the history to be.

00:46:37.020 | But what that does is essentially, like, fire off events to the real-time API to say, like,

00:46:42.620 | delete this item from the history or add this new item.

00:46:45.900 | And you can give it a previous item ID.

00:46:48.700 | So, like, you can, for example, like, slot messages into a specific spot if you wanted to.

00:46:54.060 | Does that make sense?

00:46:56.060 | But there's, like, no, like, advanced, like, long-term memory solution like you were alluding to.

00:47:03.100 | Cool?

00:47:04.540 | Yes.

00:47:05.740 | Hi.

00:47:06.780 | Do you have tips for handling input from low-fluency users?

00:47:10.700 | Like, say someone who's just learning a language and they have, like, multilingual input and maybe

00:47:15.740 | broken grammar and their pronunciation is not so good?

00:47:17.980 | I don't think I have any, like, best practices right now that I could share.

00:47:23.500 | Can it handle it just off the shelf?

00:47:26.220 | It can handle, like, switching languages and things like that.

00:47:30.540 | Okay.

00:47:31.100 | But it might not be able to handle low fluency.

00:47:33.500 | I don't know if we have any use cases.

00:47:37.660 | Yeah, we have some customers that are, like, language learning companies.

00:47:47.660 | So, there is some that are using it that way.

00:47:49.980 | But I don't think I have any, like, best practices that I can share.

00:47:52.620 | Okay, thank you.

00:47:53.660 | You're welcome.

00:47:54.220 | Sorry.

00:47:58.540 | Back in the code, is there a callback for the interrupt?

00:48:02.380 | And does it include the last transcription?

00:48:05.340 | Um, there is a call for the interrupt, but, uh, there is no, um,

00:48:11.740 | there's no actual, like, event that --

00:48:18.140 | There's no param or e that comes with it or anything like that?

00:48:20.700 | No, there's currently no, um, transcript.

00:48:23.500 | So, what you can do is, if you're getting this, you can call --

00:48:27.740 | Get history or something?

00:48:29.820 | Uh, there's just history.

00:48:31.180 | Okay.

00:48:31.900 | So, like, this always is up to date.

00:48:33.820 | Cool.

00:48:34.460 | Um, so you can -- you have access in that moment.

00:48:36.860 | Okay.

00:48:37.420 | The thing that we do have is, um, for tool calls specifically,

00:48:44.460 | um, you're getting some additional context,

00:48:48.300 | and that context has a, um, history parameter

00:48:53.100 | that you can, like, push into.

00:48:54.460 | Okay.

00:48:55.500 | Um, it's more documented in the -- in the documentation.

00:48:59.020 | In the API.

00:48:59.740 | Okay, great.

00:49:00.540 | Thank you.

00:49:00.940 | You're welcome.

00:49:01.420 | Awesome.

00:49:04.300 | Um, let's move a bit on

00:49:06.940 | and show a couple of other things.

00:49:08.620 | So, we talked about tools.

00:49:09.740 | As I said, like, one of the benefits

00:49:11.820 | is you can reuse the same syntax that you're doing with text-based ones.

00:49:15.340 | Um, it's also a good way for you to then communicate with your

00:49:19.260 | back-end systems using HTTP.

00:49:22.060 | Um, follow sort of a, um, general practice around, like,

00:49:30.060 | keeping both the tool calls as low latency as possible.

00:49:34.940 | Like, send out a tool.

00:49:36.700 | Like, for example, if you know a task is going to take longer,

00:49:39.980 | start the task, give it a task ID, and have the agent have a tool to check on the status,

00:49:45.980 | for example.

00:49:47.020 | Like, that helps getting back to it because, again, while the tool call is going on,

00:49:52.700 | the model is sort of stuck.

00:49:54.940 | So, you want to -- you want to make sure to, like, get back to that as soon as possible.

00:49:58.940 | Um, one of the other things that you can do is human approval -- uh, human approval.

00:50:03.660 | I can show you that quickly. There's essentially a use, uh, it's a needs approval

00:50:10.300 | that, um, you can either specify as a function that will be evaluated before the tool ever gets

00:50:17.820 | triggered. This is a great way if you have, like, a more complex logic on "I need approval for this."

00:50:23.820 | You can also give it just straight up, "I always need approval," at which point there is a

00:50:30.300 | another event here, um, tool approval requested, and then that gets a, um, event here, so we can

00:50:43.020 | do things like, um, good old prompt. Um, and then we can go and approve that tool call again.

00:51:02.220 | I don't know why the autocomplete is not working. Um, um, proof. There we go. And, uh,

00:51:16.780 | why is it --

00:51:29.820 | this is where I go into the docs because I do not remember why this is autocompleting the wrong way.

00:51:40.140 | But everything I'm showing you is in the docs. Um, so we can just -- oh, took the wrong thing, right? The first --

00:51:55.660 | there we go. Approval request.

00:52:04.540 | Approval request. Thank you. It's like the classic thing when you're on stage and you can't really --

00:52:12.620 | there we go. So, in this case, I'm just going to always approve. But if we now go in,

00:52:24.620 | go in, "Hey, um, can you tell me the weather in Seattle?"

00:52:29.500 | So we can, in that case, approve it. It's always going to approve right now because I'm not actually

00:52:38.620 | checking the status. But, um, that means you can build, like, a human in the loop approval experience.

00:52:44.220 | This is really convenient, especially if you're running it in the browser and you just want to have,

00:52:48.700 | like, a confirmation of, like, the tool is hallucinating things before the customer

00:52:52.300 | actually submits it.

00:52:53.580 | And does it do it directly? Can it actually say, "Are you okay if I do this?"

00:52:58.460 | So basically, this is happening -- so the question is, does it automatically do this? Like,

00:53:08.700 | the -- what we're doing and the reason why this is separate is the model is asking for this tool to be

00:53:15.900 | executed. But we're intercepting this, basically, before we're ever generating or executing the response.

00:53:22.300 | This is intentional so that, like, you don't have to deal with -- like, we want you to think through

00:53:29.340 | why should this tool need approval as opposed to doing that somewhere halfway through your tool execution.

00:53:35.260 | And you have to, like, deal with the consequence of rolling back every decision that you've made,

00:53:39.740 | for example. And so, by default, if this is just needs true, it cannot get past that until the execution

00:53:47.340 | was approved, at which point it stores it in the context that is stored locally and then bypasses

00:53:53.100 | that security. So this is not happening on the model level.

00:53:56.860 | I'm going to remove that again.

00:54:01.660 | So the other thing we talked about already, but I want to show it in practice, is handoffs. So a handoff

00:54:09.500 | is essentially just a specialized tool, a call that resets the configuration of the agent in the session.

00:54:17.660 | So that we can update the system instructions, we can update the tools, and make sure that we can

00:54:24.460 | nicely scope the tasks of what we're trying to solve. So what you cannot do -- I know people are probably

00:54:33.260 | going to ask about this -- is you can't change the voice of the agent mid-session. You could define

00:54:40.380 | different voices on different agents, but the moment that you're, like, the first agent that starts talking,

00:54:46.460 | that's the voice that we're going to stick with throughout the entire conversation. So that's a caveat

00:54:52.300 | to just keep in mind. But they're still very helpful to, let's say, have a weather agent here.

00:55:05.900 | We'll do this. And then what we can do is we can actually give it a handoff description. So if you

00:55:08.540 | don't want to have this in your system prom, but you just want to help them, we'll do this. And then

00:55:18.380 | we can actually give it a handoff description. So if you don't want to have this in your system

00:55:23.180 | prom, but you just want to help the model understand when to use this, you can say, like, this agent is an

00:55:44.060 | expert in weather. And then this one is going to have that weather tool. We're going to remove it from this one,

00:55:50.780 | and we're going to give it a handoff instead to that other weather agent.

00:55:54.140 | So now if I'm going to restart this. Hey, can you tell me the weather in New York?

00:56:02.460 | The weather in New York is sunny, so you might want to grab your sunglasses if you're heading outside.

00:56:11.340 | Enjoy the day. All right. That's that's the model's best attempt at a New York accent.

00:56:18.540 | We'll take it. But you can see there that, like, it automatically handed off from that first agent to

00:56:24.060 | that second one and let it handle it. You can, through prompting, do things like, do you want it to

00:56:29.580 | announce that it's about to handoff? Do you not want to do that? Sometimes it's a bit awkward if you're

00:56:34.460 | forcing it to always do it. So, like, I would not necessarily try it, but maybe that's the type of

00:56:39.740 | experience that you want to have. So that's handoffs. Let me do you a favor and push that code.

00:56:47.020 | Bush, maybe Bush. So the agent can't change the voice when passing to another one, but it can't

00:56:54.780 | change accents? Yeah, so that's a good question. The agent can't change the voice, but it can change the

00:57:04.540 | accent. Again, this goes back to, like, the model is a generative model. So you can prompt it to have

00:57:15.340 | different, like, pronunciations, tonality, like, voice in that sense, but it cannot change the voice model

00:57:25.340 | that is actually being used to generate that output.

00:57:30.220 | So maybe as a extension of that, is it, like, the whole real-time request body that can't be changed

00:57:37.340 | or just the voice? So can I, like, create a tool that could adjust the speed if someone was saying it's

00:57:43.180 | talking too fast for the noise reduction? You should be able to change. I have not tried the speed

00:57:50.060 | speed parameter changing at mid-session because it literally came out today. I don't know. Anoop

00:57:57.180 | of you tried this. No. But, like, you can, like, essentially a handoff does change the session

00:58:03.500 | configuration. Like, if we look back at, like, the, like, one of the transcripts here, like, now that we

00:58:13.020 | have a handoff. Let's go to this trace. So you can see here that, like, it called the transfer to

00:58:21.260 | weather agent. But then, like, these instructions were talked with a New York accent. So in this case,

00:58:32.380 | it did change the instructions midway through the session. And the same way, like, when that handoff

00:58:38.060 | happens, we take away the tools, we give it new tools. So you can change those tools. You could

00:58:44.300 | have a tool to change the tool, but my recommendation would be, like, use a different agent for that.

00:58:49.420 | But then, like, the speed control, like, you could, you should be able to send off an event, but I have

00:58:55.180 | not tried that. Yeah, or maybe, like, like, the background. Like, basically, like, if you had

00:59:00.380 | something and someone was, like, in a noisy environment, like, hey, you seem to be getting

00:59:05.900 | interrupted. Could you adjust what we catch in the background and start adjusting that parameter

00:59:11.180 | so just the voice is protected as far as possible? Yeah. So the question is, like, for example, if someone

00:59:17.900 | is, like, in a noisy environment, like, could you have the agent detect that and then use,

00:59:23.500 | like, adjust some of the session configuration to deal with that and just the voice is protected?

00:59:30.060 | I don't know, honestly, which parameters are protected or not. The good thing is, like, the API will throw

00:59:36.060 | an error if the thing didn't work. So it's a good way. It's a good thing to experiment with.

00:59:41.100 | You could do that in Python. Hmm? With the previous ones, you could do that in Python.

00:59:47.580 | Oh, yeah. Well, with the Python Agents SDK, we're doing the chained approach. We don't have a real-time

00:59:55.340 | built-in yet. So... Just calling the old API in Python, you could change it. Oh, all right. Yeah,

01:00:03.500 | yeah. Then it should work. Yeah. If you can do it in Python, like, it should just work.

01:00:09.980 | Cool. So the other thing we talked about is this delegation part. So that's what I, like,

01:00:19.740 | had mentioned earlier that was in the diagram. So this is essentially where you want to be able to

01:00:26.460 | have certain complex tasks dealt with by a more intelligent model. And the way we can do that is

01:00:33.580 | essentially just creating another agent except on the back end. And because the TypeScript SDK works both in

01:00:42.220 | the front and back end, we can do that through -- I think I have a -- let's see if we have a file here or not.

01:00:52.220 | We can do that using the same SDK. So I'm going to create on the -- in the server folder here, a new

01:01:00.300 | file I'm going to call just agent. And in here, we can build our regular text-based agent. So this is

01:01:09.660 | essentially the same code that we've done before. And we can say -- this is a -- I don't know --

01:01:20.460 | called the Riddler. You are excellent at creating riddles based on a target demographic

01:01:36.380 | and topic. And we'll just give it a model of O4mini.

01:01:42.940 | Also, a reminder for those, if you are trying to follow along and you run into troubles, post in

01:01:53.100 | the Slack, and Anoop can help you with that. So we have that new agent here. We're not going to give it

01:01:59.740 | any tools or anything. And then we can export a function here

01:02:05.180 | that we just call a run agent. And this is just going to take some input and then return that output.

01:02:17.660 | And we can go back into our front-end code, create a new tool here, create riddle.

01:02:27.100 | And this one, we're just going to have -- take like two parameters, the demographic and the topic,

01:02:33.740 | and then call out the run agent function that is going to run on the server.

01:02:42.140 | We can pass in -- actually, realize I didn't specify this. Let's do demographic and topic. And then create

01:02:50.940 | an input here of this. The other thing you want to do when you're using server actions in

01:03:01.420 | next is put that used server at the top. That makes sure that this file executes on the server.

01:03:08.460 | And then we can pass in that demographic. And again, if you're using a different framework,

01:03:14.140 | this is just the equivalent of a fancy fetch request. So if you want to do an HTTP request to

01:03:19.740 | your server, if you want to maintain a WebSocket connection to your own backend, you can do all

01:03:25.100 | of those things to talk back to other systems. So with that, we can give that to our main agent.

01:03:37.900 | And then what you want to do in these cases is -- you can tell it like announce when you are about to do

01:04:00.780 | a task. Don't say you are calling a tool. Things like that can be helpful to like buy itself some time.

01:04:11.820 | Hey there. Can you tell me a riddle for like a five-year-old Star Wars fan?

01:04:26.220 | Hey are you there? I'm still here. I'm working on creating that Star Wars riddle for you. It should

01:04:34.220 | be ready in just a moment. Here's a riddle for your little Star Wars fan. I'm not -- So you can see

01:04:41.340 | that like because it announced that like what it's about to do, the tool call came back before it even

01:04:49.260 | finished what it previously said. And so like that's again one of the benefits of like if you can get your

01:04:54.940 | agent to balance out that and like buy itself some time, this is a good way to deal with the more

01:05:03.420 | complex tasks. And like it also means that you can like for example take all of the like more reasoning

01:05:10.700 | heavy workloads and take it out of the voice agent model. For delegation, is it possible to delegate to

01:05:20.220 | more than one agent like simultaneously or is it just one in the current SDK?

01:05:24.380 | You can -- it's tool called. So like I think you like you would have two options, right? Like you could do

01:05:31.500 | like parallel tool calling or you could like have one tool that then triggers running multiple agents, right? So like

01:05:42.700 | like my recommendation would be that part potentially so that you're not relying on the model making the

01:05:49.420 | right decision of calling multiple

01:05:51.180 | full tools at the same time. Like you want to make the decision making for the voice agent always as easy as

01:05:57.100 | possible. All right, thanks. Yep. Previously when you did -- you go back to the previous page, the --

01:06:03.020 | Yeah. Oh, no. Sorry, your example.

01:06:04.940 | Which example? The one you're running where you had the output. Oh, here.

01:06:11.260 | So on line three there where it said I'm still here. Yeah. Is that coming from your SDK? Do I have to use the SDK to do that?

01:06:19.580 | It's the real-time API. No, it's the real-time API. It responded because I asked like, hey, are you there?

01:06:24.700 | So it -- it -- it realized that like it didn't start anything and like interrupt and it was like, hey.

01:06:33.420 | There's a tool call going on. Yeah.

01:06:35.100 | Yeah. So the thing is that like I didn't render out the tool calls in here, right? So like what basically happened between this

01:06:49.180 | And this was like it started off a tool call and then that tool call because I interrupted it then

01:06:57.100 | It stopped that tool call because I interrupted it. It stopped the generation. It also reset that like transcript here.

01:07:06.140 | It's a good indicator that like the interruption happened. And so when I said this it remembered it was trying to call a tool did that tool call

01:07:13.500 | And then gave you back the response

01:07:17.340 | So that's all the just the the regular real-time API

01:07:20.700 | Any any other questions around this?

01:07:25.740 | Yeah

01:07:28.860 | What's the cost per minute?

01:07:30.700 | What's the cost per minute? We charge per token. There's some

01:07:33.900 | Translations. I don't know a nuke. Do you have the

01:07:40.060 | All right, so it's more expensive than TTS and

01:07:53.500 | And like speech attacks chained up with a model something like in most case, but it depends on the use case

01:08:02.620 | And sort of like real model choices and stuff

01:08:04.700 | So a bit harder to say like what the per minute pricing is because again, it's by tokens

01:08:10.540 | And it also depends on like if you have like transcription turned on and like how many function calls you have and things like that

01:08:17.340 | Because it's a mix between audio and text tokens

01:08:22.380 | So one of the interesting things and this is not a thing in the in the regular API

01:08:28.860 | This is a agents agents as dk specific things

01:08:31.660 | Is guard railed so like the agents sdk both in python and typescript

01:08:36.860 | Has this concept of guard rails that can either protect your input or your output

01:08:41.260 | To make sure that like the agent is not being meddled with or does things that are against policy

01:08:46.540 | We took that same pattern and moved it over to the real-time side

01:08:51.580 | Um, where essentially we're running these guard rails that you can define

01:08:55.900 | In parallel on on top of the transcription at all times you can kind of specify you can see it at the bottom here like

01:09:03.740 | How often you want to run them or if you only want to run them when the full transcript is available?

01:09:08.060 | But this is a great way for you to like make sure that the

01:09:11.580 | Model for doesn't violate certain policies you want to make sure that these run

01:09:16.460 | As efficiently as possible because they're still running in the client

01:09:20.620 | But this is a good way to stall fix or like stick to certain policies and if it violates those

01:09:26.540 | It will interrupt it now there is sort of the bit of the caveat a bit of a caveat where because we're running this on a transcript

01:09:33.740 | It

01:09:37.900 | Results in a bit of a timing aspect where if it if it would violate

01:09:42.540 | Your guard rail in the first couple of words

01:09:46.620 | Chances are it will say those first couple of words

01:09:49.660 | And then get interrupted if it is

01:09:52.620 | Happening at a later point in time

01:09:55.820 | um, then

01:09:57.660 | The transcript will be

01:10:00.780 | Complete or like the text output is not really a transcript the text output will be done

01:10:06.060 | Before the audio is done speaking every like is done saying everything

01:10:12.380 | and so

01:10:15.260 | In that case it will just correct itself

01:10:16.860 | So in that case like to give you an example of like this is a guard rail that just checks like is there the word dom

01:10:23.180 | In the output

01:10:25.660 | In this case like if I would ask it like hey, please call me dom

01:10:29.820 | Chances are it will call me dom and then self-correct

01:10:32.860 | If I tell it to tell me a story and only introduce dom in the second act

01:10:39.820 | Then it will um catch that

01:10:43.740 | It was trying to do that in a much at a much earlier point because that transcript is going to be

01:10:48.700 | Done before the audio is being complete before the audio is being read out to the user

01:10:54.460 | So the user will never hear dom

01:10:56.620 | But instead the model is going to be like, okay. I'm sorry. I I couldn't help you with that

01:11:00.540 | Like let's do something else instead and you can give it

01:11:03.260 | um

01:11:05.900 | Policy hints essentially on like why

01:11:08.380 | It violated this policy or

01:11:11.100 | What it should do instead so you can give it these this like output info where you can inform the model like

01:11:17.340 | Why this happened

01:11:22.540 | No, um, you can choose what uh, so the question was is the transcript still done with whisper you can you can switch the transcript models

01:11:30.780 | So we have we released in march

01:11:33.420 | two models one is gpt for mini transcribe and

01:11:38.380 | With two right yeah, and gpt for gpt for transcribe

01:11:45.260 | I was trying to remember like we have only one text-to-speech model, but we have two transcribe models for transcribe models

01:11:52.220 | Awesome

01:11:54.220 | Awesome, um

01:11:56.460 | This is the main part of what I wanted to walk with you all through so

01:12:00.780 | um

01:12:02.460 | One I'm gonna

01:12:04.460 | Post all of the links and the slides in the slack channel which let me go back to

01:12:12.780 | Where was that slide?

01:12:17.020 | There we go

01:12:18.300 | Um, so in that slack

01:12:20.300 | I'm gonna post all of the

01:12:22.300 | Resources so if you want to check them out afterwards

01:12:24.780 | I'll also like I already put a bunch of the resources that I talked about

01:12:29.740 | Into the bottom of that starter repository, so you should have access there as well

01:12:34.540 | And I'm happy to hang around answer any questions

01:12:38.780 | Yes

01:12:41.580 | What is broad casting of the voice?

01:12:45.020 | My understanding is for text for the 90s, broad casting of the neighborhood, it goes to the same font relation

01:12:56.620 | Yeah, the question was around how prompt caching works um and sort of whether we like for prompt caching that like we guide

01:13:01.660 | Yeah, the question was around how prompt caching works um and sort of whether we like for prompt caching that like we guide

01:13:03.740 | The requests to the same um um same system again to run and whether like there's any control

01:13:19.660 | With that with real time because latency obviously matters um I don't think there's any controls about that um

01:13:40.140 | No, I'm getting a no from there, so like I don't think there's any any controls around that right now

01:13:46.700 | Yes

01:13:50.380 | Hey, so we all know that having natural conversations is involves more than just spoken words it involves

01:13:59.180 | detecting emotion and adjusting tone um it also involves a cadence and

01:14:06.380 | Even humming to let the other person know that you are listening

01:14:11.980 | I wonder if the current speech to speech model is capable of having that kind of natural conversation

01:14:18.380 | Um part of this is like a prompting challenge so like it definitely can have pretty natural sounding conversations and like

01:14:28.220 | I think this is sort of the

01:14:30.860 | The part where I highly recommend to check out the openai.fm page

01:14:36.620 | um because this is like it's it's pretty interesting to see sort of if we go to

01:14:42.780 | um let's see if we find the

01:14:45.660 | um

01:14:47.660 | Actually, I didn't show one neat feature

01:14:54.140 | That I normally call out on the playground if you're just getting started with real time and you don't even want to write any line of code

01:15:03.020 | Um, this is a great way to just have conversations and try things out

01:15:06.940 | But one of the things that has is this

01:15:08.940 | It has a couple of system problems one of my favorite ones to show sort of this is the bored teenager

01:15:17.260 | Um, so if we start this

01:15:19.260 | Hey there, um, so i'm at ai engineer world's fair and everyone is super stoked about voice agents

01:15:28.460 | Can you show me some excitement of this whole thing launching today?

01:15:32.460 | Let's see

01:15:37.100 | I guess it's cool

01:15:40.060 | Whatever there's always new stuff launching

01:15:44.140 | People get excited, but you know

01:15:46.140 | It's just voice agents

01:15:47.900 | Not really my thing to get all hyped up about it

01:15:50.940 | So you can see in this case like it it put its own pauses in there and stuff like this wasn't a pause because the model was waiting right like

01:16:01.180 | Um, it can deal with a lot of that sort of adjusting tone and voice and they can do similar things

01:16:07.740 | Like reacting to like someone talking in and stuff

01:16:12.460 | Okay, thanks. You're welcome

01:16:14.460 | Yes

01:16:15.100 | So again on the new

01:16:16.860 | uh api that's been released is what else has changed so managed it's improved is there anything else that's improved in terms of tone detection or voice?

01:16:24.140 | Um, we have not released any new vat models

01:16:27.180 | We primarily released like a new base model for or like a new model for the

01:16:32.940 | um

01:16:34.300 | gpt4 real-time model that is just better at function calling and has been

01:16:40.700 | Overall well received from our alpha testers

01:16:42.700 | Yes

01:16:44.380 | Can you inject

01:16:46.380 | Can you inject audio as a background audio like ambient audio like typing audio and so on?

01:16:54.060 | Um, you can like

01:16:56.540 | Basically like you're in a real office

01:17:00.140 | Right, yeah, um, you can just intercept the audio that is coming back from the model and then like overlay your own audio

01:17:07.020 | Oh

01:17:09.020 | Yeah, hi, um, I was wondering what your support is for like multiple speakers

01:17:15.100 | if there's more than one person in a conversation can it detect who's talking and and do pauses that way?

01:17:20.220 | There's no no current like speaker detection in the model

01:17:24.380 | Um, so it might struggle with that

01:17:26.380 | Yeah

01:17:27.260 | Um

01:17:32.860 | I just wanted to ask um about like custom voices is it limited to the preset voices you have or can I upload my voice as a sample for instance?

01:17:41.420 | It's currently limited to the voices that we have we keep adding new voices though

01:17:46.940 | Is there going to be support to add custom voices anytime in the future?

01:17:50.300 | Um

01:17:50.620 | At any time in the future?

01:17:52.940 | Not like not like years

01:17:55.500 | What I mean what I can say is like we're trying to make sure we're finding like the safest approach

01:18:02.380 | Like we have an article on online that talks about sort of like the responsible take we're trying to take on this

01:18:08.860 | On making sure that like the voice the voice like if we're providing custom voices that it comes with the right guardrails in place and stuff

01:18:17.100 | To avoid abuse all right. Thank you. You're welcome. Yes

01:18:20.700 | Is there still a 30-minute session?

01:18:24.060 | And if so, what is the recommendation?

01:18:27.020 | I don't think that has changed um to my knowledge

01:18:32.860 | My personal recommendation would be that like one of the things that you can do

01:18:37.980 | And this goes back to like for example in the

01:18:41.260 | demo that I was showing

01:18:44.380 | Um if you're keeping track of the transcript and stuff

01:18:47.660 | You can read like when you're starting a new session

01:18:52.060 | You can populate that context by creating new items using the api

01:18:57.820 | So like one of the things that you could do is starting a new session if you know what the previous context was because you kept track of it

01:19:03.980 | You can then um basically inject that as additional context

01:19:11.740 | Um, what type of event are you looking for? Oh, if there's a timeout event, um

01:19:17.100 | I do not know right now

01:19:20.060 | But all of our events are documented in the api reference

01:19:26.220 | Yeah, so when you see a real-time api can code tubes and function calls

01:19:31.340 | Uh, really includes like system file reading and writing in those packages?

01:19:38.780 | Can you sorry, can you repeat that question one more?

01:19:41.020 | Oh, I was wondering if a real-time api can use function call

01:19:45.020 | uh functions such as system file writing or reading

01:19:49.180 | Oh, uh whether the function calls can do things like system file reading and stuff

01:19:53.820 | I would say it depends on where you're running that a where you're running that

01:19:57.900 | Real-time session so if you're running it on the server you can do anything you can do on the server

01:20:02.540 | If it's if it's running in the browser

01:20:04.780 | Then you're limited to whatever things are available in the browser

01:20:13.740 | Um, no, you should be able to like you could create a like web socket based like

01:20:19.100 | um

01:20:20.220 | Voice agent that runs

01:20:22.220 | On on your device. I mean like it's going to use the real-time api for the model

01:20:27.100 | But then like because the actual tool calls on goal get executed on your system

01:20:31.500 | You should have access to whatever aspect of your system your program has access to

01:20:39.100 | Cool. Yes, so even before we get to voice agents we all need bigger better and more diverse evaluation sets

01:20:47.180 | Especially for anything around function calling and parameterization

01:20:50.540 | Do you have any best practices or suggestions for how we now take evaluation into the voice world?

01:20:56.620 | Should we keep things in text and then turn them you know use text to speech to have voice versions of it?

01:21:02.780 | Just if you have any suggestions for how we

01:21:05.180 | Evaluate the the full range of inputs that we would expect users to bring to this

01:21:08.940 | I mean one of my suggestions would be if you can go to the leadership track go to a noops talk and onwards tomorrow. I think yeah

01:21:15.500 | He's going to talk a lot more about like additional best practices of what we've learned um in

01:21:21.260 | Building voice agents. I would say like if you can hold on to the audio like it. It's helpful

01:21:30.540 | Obviously transcriptions definitely but like it's sort of like the audio is still the thing that is like

01:21:36.380 | The most powerful thing especially for speech-to-speech models where you have

01:21:40.380 | The model act on the speech not on the text right and like this is one of the one of the few things where like the

01:21:48.620 | Chained approach obviously makes some of this much more approachable because if you have if you're

01:21:56.380 | If your agent is running on text

01:21:58.380 | Anyways, and you can just store the text and rerun it that makes that part of evals a bit easier

01:22:03.580 | That makes sense and then also for those of us who might be thinking about launching a new voice agent

01:22:10.060 | How would you suggest evaluating it before we get to that stage that we'd have customer interactions to work with?

01:22:16.620 | um, I would start with like having like a thing goes back to like

01:22:21.740 | Like some of this is like

01:22:27.180 | There um like human review is an excellent solution for this right so like have it like have

01:22:37.740 | A system like like one of the big things with like things like lemonade and stuff

01:22:41.980 | Like they're able to go through all of these calls and like get an idea, but they also have their own like predetermined set of

01:22:48.780 | Examples that they might want to test as they're developing the agent

01:22:52.860 | So like that's a great first way I would scope clearly the problem you're trying to solve as well

01:22:58.060 | Like if you're if you're trying to sort of boil the ocean it makes a lot of this significantly harder as opposed to like

01:23:04.700 | Well scoping what the agent should be able to do and what it shouldn't be able to do

01:23:09.820 | Makes sense. Thank you. You're welcome

01:23:11.820 | Get getting on the same path of our friend over there

01:23:15.260 | Actually, when I'm testing my agents my text agents conversational I use prompt full

01:23:21.900 | It's a platform for do all the proper testing could I put another agent?

01:23:27.980 | To talk with this this agent to do all the evaluation

01:23:31.980 | I think you can try it um

01:23:36.460 | Like I don't think it shouldn't work

01:23:39.420 | Like so I put another voice agent talking to that agent to try to execute all the prompts and then I could get like the transcription

01:23:50.300 | Yeah, I mean

01:23:52.060 | I mean, I know we have we have use cases where customers also use our models to like

01:23:57.420 | prompt humans right so

01:24:00.780 | Um, where it's like for for like training use cases or other things for example

01:24:05.180 | So she'll work out, but I don't know if anyone uses that kind of approach and lemonade does

01:24:11.420 | Oh cool. So the second the second picture is is exactly that awesome. Thank you. You're welcome

01:24:22.380 | Any any any other questions

01:24:24.380 | Yeah, go ahead slightly related

01:24:27.500 | uh, do you have something around wake word detection on the real-time api roadmap or

01:24:32.300 | patterns for wake words? Oh reports, um, no wake words. So that'll wait activating the

01:24:38.460 | No, we don't have any wake words built in or anything

01:24:43.900 | No patterns either like any patterns to avoid costs

01:24:47.420 | Um, no one device. No

01:24:49.900 | You could basically like what you can do is you can build your like you can turn off our

01:24:56.220 | voice activity detection and then build your own

01:24:59.100 | um, and then basically use that so like you could use a model that has

01:25:03.740 | like a like a vad

01:25:05.820 | voice activity detection model that has wake words in it and then like

01:25:11.260 | Do it that way and basically commit all of that audio to our api and then send like a commit like a commit event

01:25:18.140 | cool

01:25:21.660 | Awesome. Thank you so much for taking the time and spending the afternoon with me

01:25:27.500 | Thank you

01:25:29.500 | Thank you

01:25:31.500 | We'll see you next time.

Building voice agents with OpenAI — Dominik Kundel, OpenAI

Chapters