back to indexBuilding voice agents with OpenAI — Dominik Kundel, OpenAI

Chapters
0:0 Timestamps
0:16 Introduction to voice agents.
1:28 Overview of the OpenAI Agents SDK for TypeScript.
3:27 The case for why voice agents are important.
4:21 A look at different architectures for voice agents.
60:16 Best practices for building voice agents.
77:31 A hands-on guide to building a voice agent.
00:00:36.880 |
So head over to that QR code or to that starter repository 00:00:41.920 |
That might take a while with the internet right now, 00:00:46.220 |
You have like 15 minutes of me rambling about stuff 00:00:51.560 |
So I said we're going to talk about voice agents. 00:00:55.320 |
I want to first put everyone on the same page 00:00:57.160 |
because I know we all have different definitions of agents 00:01:00.840 |
and there's going to be a lot of definitions flying around 00:01:06.160 |
we're talking about systems that are going to accomplish tasks 00:01:11.140 |
And most importantly, they're going to be essentially 00:01:16.140 |
with some set of instructions that then has access to tools 00:01:23.280 |
And then all of that is encapsulated into a runtime 00:01:27.900 |
And that's an important definition because today we launched 00:01:35.940 |
today we basically released the TypeScript equivalent. 00:01:40.700 |
And so we're going to use that, and it maps those exact patterns. 00:01:46.720 |
it's basically a SDK that provides you with an abstraction 00:01:51.700 |
based on the best practices that we learned at OpenAI to build agents. 00:01:57.220 |
And it comes with a couple of different base foundational features, 00:02:00.880 |
including things like handoffs, guardrails, streaming input and output, 00:02:04.620 |
tools, MCP support, build-in tracing so you can see actually what your agents did 00:02:11.940 |
And then additionally to those features that are coming from the Python SDK, 00:02:15.600 |
the SDK we launched today in TypeScript also includes human-in-the-loop support 00:02:20.060 |
with reasonability so that if you need to wait for human approval for a while, 00:02:26.100 |
And most importantly, native voice agent support. 00:02:29.180 |
What that means in practice is you can use those same primitives 00:02:35.180 |
and you can build voice agents with that that handle handoffs, 00:02:37.840 |
have output guardrails to make sure that the agent is not saying things 00:02:41.440 |
it's not supposed to, tool calling, context management, 00:02:44.780 |
meaning keeping track of the conversation history 00:02:50.280 |
and build-in tracing support so that you can actually replay conversations, 00:02:55.200 |
listen to the audio of the user and properly debug what happened, 00:03:01.000 |
If you've tried to build interruptions, you might know how hard this is. 00:03:07.980 |
Both WebRTC and WebSocket support, meaning it's actually -- 00:03:11.780 |
can communicate both on the server for things like Twilio communicate -- 00:03:16.940 |
like phone call, voice agents, or directly in the client in the browser. 00:03:22.520 |
That's what we're going to use today using WebRTC. 00:03:26.700 |
But first, why would we be interested in voice agents in the first place? 00:03:33.680 |
One of the things that I'm most excited about is it makes technology much more accessible to people. 00:03:38.420 |
There's something magical about being able to talk to a voice agent and just have it -- 00:03:50.100 |
I can convey information much faster, but also it can contain a lot of information 00:03:55.340 |
through the type of tone and voice that I'm using, the emotions. 00:04:00.160 |
So it's much more information dense than sort of just basic text is. 00:04:05.480 |
One of the cool things is also it can act as like an API to the real world. 00:04:09.020 |
You can have a voice agent go and like call a business for you and like have a conversation 00:04:14.120 |
with them where maybe there isn't an API for that business. 00:04:20.760 |
And so when we talk about building voice agents, there's essentially two types of architectures 00:04:28.100 |
The first one is based on your traditional text-based agent and just sort of wrapping it into a chained 00:04:34.200 |
approach where we have a speech-to-text model that is taking the audio and then turning it into text 00:04:40.480 |
so that we can run our basic text-based agent on it. 00:04:43.780 |
And then we take that and we run that agent, take the text output, and run it through a text-to-speech model 00:04:56.500 |
One of the most common reasons why we raise this is it's much easier to get started with if you already 00:05:05.940 |
You can take that, wrap some audio around it, and you have something that you can interact with. 00:05:11.580 |
But the other aspect is that you have full access to any model. 00:05:16.380 |
Text is the main modality that any LLM has, and so you can use really any of the cutting-edge models. 00:05:23.860 |
It also gives you much more control and visibility of what the model did by being able to actually look into the exact text that went in 00:05:35.100 |
Turn detection is one of the big ones where you need to now take into consideration what did the user hear 00:05:42.740 |
by the time that they interrupted the voice agent, then translate that part back into text. 00:05:49.820 |
Make sure that your transcript is appropriately adapted so that the model doesn't think it told the user something that it didn't. 00:05:57.820 |
Chaining all of these models together adds latency on every possible level, and so that's another big challenge. 00:06:05.660 |
And then you're losing some of that audio context, right? 00:06:08.540 |
You're transcribing the audio, and if you've ever tried to convey a complicated topic over a text, 00:06:14.940 |
you know it's a bit harder than dealing with the same thing using your own voice. 00:06:20.380 |
So, an alternative to that chained approach is a speech-to-speech approach where we have a model that has been trained on audio, 00:06:29.980 |
and then takes that audio to directly interact on the conversation and make tool calls, meaning there's no transcribing in the process. 00:06:40.300 |
The model can just natively deal with that audio, and that translates into much lower latency because we're now skipping those speech-to-text, text-to-speech processes. 00:06:50.300 |
We can also now have much more contextual understanding of the audio, including things like tone and voice. 00:06:58.060 |
And all of that leads to a much more natural, fluid level of conversation. 00:07:04.860 |
One of the most common ones is reusing your existing capabilities. 00:07:08.460 |
Everything is built around text, so if you already have some of those existing capabilities or a very specialized agent for a certain task, 00:07:18.300 |
Also, dealing with complex states and complex decision-making. 00:07:22.140 |
It's a bit harder with these models, since they've been really focused on improving on the audio conversational tone, 00:07:29.020 |
less so on being very complex decision-makers. 00:07:33.740 |
But there is a solution that we can get around with this. 00:07:37.180 |
Again, taking inspiration from what we do with text-based agents, we can actually create a delegation approach using tools, 00:07:45.820 |
where we have a frontline agent that is talking continuously to the user, and then that one uses tool calls to interact with much smarter reasoning models like 04 mini or 03. 00:07:57.020 |
Actually, let me, at this point, give you a quick demo and see how the internet goes here. 00:08:04.860 |
So I have a real-time agent here that we built with the Agent SDK. 00:08:08.700 |
It's going to be very similar to what you're going to build later on. 00:08:12.620 |
But when I start talking to this, hello there. 00:08:19.260 |
How is the AI engineer at World's Fair going? 00:08:23.260 |
So we can now give it, like, give a task to, like, call tools that I gave it, like, "Hey, what's the weather today?" 00:08:31.900 |
"Let me check the weather for you. One moment, please. Transferring you to the weather expert now. 00:08:40.540 |
Actually, I can directly help you with the weather information. Could you please specify the location you're interested in?" 00:08:46.540 |
Oh, yeah. What's the weather in San Francisco? 00:08:50.540 |
So you can see here it's actually dealing with the interruption. 00:08:53.900 |
Enjoy the bright and pleasant day. Is there anything else I can assist you with? 00:09:01.980 |
You're welcome. If you need anything else, feel free to ask. Have a great day. 00:09:06.940 |
And so in a similar way, we can actually trigger the more complicated back-end agents as well. 00:09:11.900 |
So I have a tool for this to handle refunds that will call out to 04 mini and evaluate the refunds. 00:09:18.060 |
So hey there. I have one more thing. So I recently ordered this skateboard that I tried to use, 00:09:25.180 |
and it seems like I'm really bad at skateboarding. So I want to return it. It is slightly a scratch, though. 00:09:31.180 |
I'm here to assist, but it sounds like you need customer service for that. I recommend contacting 00:09:40.300 |
the company where you bought the skateboard. They can provide you with the return. 00:09:44.860 |
Looks like I didn't add the tool. Maybe I did. Oh, I didn't ask for a refund. 00:09:51.740 |
Let's try this once more. Hey there. I bought a skateboard recently that I tried, 00:09:59.260 |
and apparently I'm really bad at using it. So I wanted to return it. It is slightly scratched, 00:10:04.220 |
so can you give me a refund? Hello there. How is the AI? 00:10:13.340 |
The joys of the joys of internet. Hey, I recently ordered a skateboard from you, and it 00:10:22.540 |
failed. Like, I can't use it. I'm struggling to use it. It's slightly scratched. Can you give me a refund, please? 00:10:30.460 |
Hello there. How is the AI? I'm going to assess your request for a refund. 00:10:38.940 |
There we go. Let's get started. It is slightly struggling with this, like, 00:10:41.820 |
weird echo that we're having here. The skateboard arrived damaged, and you're eligible for a full refund. 00:10:48.380 |
We'll process that for you. All right. But you can see here that it was able to call that more advanced tool 00:10:56.300 |
and actually process that request. And one of the nice things is that, like, while time to first token 00:11:04.380 |
is often a really important thing, the longer a conversation goes, like, your model is always going 00:11:09.980 |
to be faster than the audio that has to be read out. And so this is, like, a really helpful thing where, 00:11:15.900 |
by the time that the model was able to say, like, hey, I'm going to check on this for you, it already had 00:11:22.540 |
completed that LM call to the 04 mini model to get the response there. All right. 00:11:29.660 |
Let me -- oh, one more thing. Since we talked about traces, one of the nice things now is we can 00:11:37.340 |
actually go back here into our traces UI. And this launch today, you'll be able to actually look for any 00:11:44.380 |
of your real-time API cases, look at all the audio that it dealt with and all the tool calls. So we can 00:11:50.140 |
actually see here that the tool call was triggered, what the input was, the output. We can listen to 00:11:56.620 |
some of the audio again to understand what happened. And then because both this and the back-end agent 00:12:05.580 |
use the agent's SDK, we can go into the other agent as well, which was the 04 mini one, which we can see 00:12:11.660 |
here. And we can see that it received the context of the full past conversation, the full transcript, 00:12:17.180 |
as well as additional information about the request, and then generated the response here. So this allows 00:12:24.380 |
us to then get a full, complete picture of, like, what happened both in the front-end and the back-end. 00:12:28.620 |
Let's jump back into the slides and cover a couple of more things before we get coding. 00:12:34.940 |
And that's about best practices. So I would group the best practices of, like, 00:12:41.500 |
building a voice agent into three main things to keep in mind. The first one is to start with a small 00:12:47.580 |
and clear goal. This is super important because measuring the performance of a text-based agent -- you 00:12:53.740 |
will hear a lot about evals at this conference -- is already hard enough. But with voice agents, 00:13:00.220 |
it's going to be even harder. So you want to make sure that you're very focused on, like, 00:13:03.580 |
what is the first problem you want to solve and keep it focused on that and give it a 00:13:08.540 |
like limited number of tools so that you're fully centered on this. The agent's SDK makes this really 00:13:13.500 |
easy because you can then later on add additional tools to additional agents and deal with, like, 00:13:19.180 |
handoffs between them. But this way you can kind of really stay focused and make sure that one of your 00:13:25.500 |
use cases is great and then hand off other ones to human agents, for example. The second one is 00:13:32.540 |
what I elaborated on, which is building evals and guardrails very early on so that you can feel both 00:13:38.700 |
confident in what you're building but also confident in that it's actually working so that you can then 00:13:46.060 |
continue to iterate on it and know when it's time for you to, like, grow the complexity of your voice agent. 00:13:52.700 |
As of today, you can use the traces dashboard for that. But alternatively, some of our customers 00:13:59.980 |
have even built their own dashboards like Lemonade to really get an end-to-end idea of the customer 00:14:05.420 |
experience and then even replace some of these conversations with their agent as they're iterating on 00:14:10.620 |
it. The other thing that I'm personally super excited about with these models is both our speech-to-speech model 00:14:17.900 |
and our text-to-speech model are generative models, meaning you can prompt them the same way that you 00:14:22.780 |
can prompt a LLM around tone and voice and you can give it emotions, roles, personality. We built this 00:14:31.580 |
little microsite called openai.fm. It's a really fun website to play around with where we have a lot of 00:14:37.340 |
examples of different personalities and how that style of prompt can then change what is being read out by our 00:14:44.940 |
text-to-speech model. And so that's a great way for you to not just limit one second, limit the experience 00:14:52.780 |
of your model or, like, the personality of your model by the voice that you picked, but also by the prompt 00:15:00.300 |
and instructions that you're giving it. That was a question there. Would you mind using the mic that is 00:15:05.020 |
right behind you just so that it's on the recording? Hello, sir. So my question regards to the previous 00:15:14.380 |
slides on Lemony. So you're displaying how they have this dashboard where they can show all of this. 00:15:21.500 |
Is this a dashboard that OpenAI provides and Lemony just integrates as, like, an iPhone or something? 00:15:27.900 |
No. So in this case, they built their own solution for it. Okay. And does OpenAI then provides all the 00:15:34.540 |
JSON or the data structure that we can just plug into the... So the way the real-time API under the hood 00:15:40.460 |
works is that you get all the audio data and you can do whatever you want with that, basically. You're 00:15:45.820 |
getting all the necessary audio events so you can use those data structures. So we're not storing them by 00:15:50.540 |
default. You can use the Traces dashboard. We don't have an API for it yet, but you can use the Traces 00:15:57.580 |
dashboard to get a basic look of that, but it's not iframeable. But you mentioned it's only audio data. 00:16:06.220 |
This shows not just audio, but also the transcription and all of that as well, right? So the Traces dashboard, 00:16:11.820 |
if we go back to it, does show all of the transcripts and stuff as well, as long as you have 00:16:20.860 |
transcription turned on, which I don't seem to have turned on for this particular one. But it should, 00:16:29.340 |
like, you can turn on transcription and you should be able to see the transcripts as well. 00:16:37.980 |
All right. Let's go back to this. The other part with it is, as I said, you can prompt both the 00:16:47.100 |
personality. You can also be very descriptive with the conversation flows. One of our colleagues found 00:16:53.340 |
that giving it conversation states instead of this JSON structure is a great way to help the model 00:17:00.060 |
things through sort of what processes and what steps it should go through the same way that you would give a 00:17:05.340 |
human agent a script to operate on. If you're struggling to write those scripts, though, 00:17:11.260 |
we also have a custom GPT that you can use to access that. And I'll share all of those links and a copy of 00:17:16.380 |
the slide deck later on in the Slack channel. So if you're in that, you should be able to access those. 00:17:21.660 |
But with that, that's primarily what I wanted to talk through from a slides perspective. So from here on, 00:17:32.460 |
what I want to do is build with you a voice agent. We'll see how that goes with the internet. 00:17:37.500 |
Also, if you have headphones, now is a great time to bring them out. It's going to be really weird 00:17:42.700 |
when we're all going to talk to our own agent. But we're going to try this and see how that goes. 00:17:48.380 |
So if you came in later, please scan the QR code. Go to that GitHub repository and set that up. Install the 00:17:57.340 |
instructions. There's no code in it yet other than like a boilerplate Next.js app and a empty 00:18:03.660 |
like package JSON that install just like the dependencies that we needed so that we are not all trying to run 00:18:10.780 |
npm install at the same time. But what I want to do is build a first agent. So if you want, you can just 00:18:20.540 |
straight up copy the code that is on here. But I'm going to actually go and type it along with you all so 00:18:29.420 |
that you get a feeling for what's happening. And we have a good idea of timing. So if you want to take a 00:18:36.300 |
picture now, just code ahead. Do that. And otherwise, I'm going to switch over to my code editor and we're 00:18:44.860 |
So if you're running into trouble, the Slack is a great way to post questions that are technical 00:18:52.380 |
questions. And Anoop, who's over there, is going to try to help you. Alternatively, raise your hand, 00:18:59.340 |
but it's a bit easier if you're just slacking the messages there and we can kind of multi-thread the 00:19:05.500 |
problem. All right. Let's go and build an agent. So if you clone the project, you should see an index 00:19:14.460 |
index.ts file. Go and open that and you should be able to import the agent class from the openai/agents 00:19:25.500 |
package. That's what we're going to use to create the first agent. Yeah? Oh, yeah. Good call. 00:19:43.900 |
Is that better? Cool. That seems a bit -- seems worse on my side than yours. But I think as long as you 00:19:53.020 |
all can read that, I'll be fine. All right. So what I want you to do is go and import an agent. And we're 00:19:59.180 |
going to define our first agent. And as I mentioned, primarily, an agent has, like, a few centerpieces. 00:20:07.100 |
The first one being a set of instructions on what to do. So we can give it instructions. I'm going to say, 00:20:12.940 |
you're a helpful assistant. It's sort of the most boilerplate thing you can do. We do need 00:20:17.420 |
to also give it a name. And that's so that we can actually keep track of them in our traces dashboard. 00:20:22.780 |
I'm going to say my agent here. This can be anything that helps you identify it. 00:20:28.220 |
And then we need to actually execute this agent. So we can import a run function here. 00:20:36.220 |
And then we can await the run here. I'm going to run this agent with just hello, how are you? And then 00:20:47.900 |
log out the results. And with the results, we get a lot of different information. Because essentially, 00:20:54.540 |
when we run an agent, it's going to do a lot of different tasks, from executing all necessary tool 00:21:00.300 |
calls, if there are any, to validating output guardrails, etc. But one of the most common things 00:21:05.740 |
that you just want to log out is the final output. That's whatever the last agent in an execution said. 00:21:13.180 |
So in this case, it's going to be a set of text. And then you should be able to run npm start, 00:21:19.820 |
npm run start 01. And that should execute it. And then you should see something like this, 00:21:33.340 |
depending on what your model decides to generate. And by default, this is going to run GPT 4.1 00:21:39.500 |
as the model. But if you want to experiment with this, you can set the model property here. 00:21:47.500 |
04 mini, for example, and then rerun the same thing. So this is the most basic agent that you can build. 00:21:56.620 |
But one of the things that really makes something an agent is if it can execute tools. 00:22:01.900 |
So we can import a tool here. There we go. And we can define a get weather tool. 00:22:10.540 |
One of the things here is you have to specify what arguments the model is going to receive. And one 00:22:22.140 |
of the ways that you can do this is through a tool called Zod. If you've never heard of it, 00:22:26.380 |
it's essentially a way to define schemas. And what we'll do is we'll both 00:22:31.660 |
use that Zod schema to inform the model on what the parameters for this function call are. 00:22:37.020 |
But we're also going to use it to validate then what are the 00:22:39.900 |
actual arguments that the model tried to pass in and do they fit to that schema. So we get 00:22:45.340 |
full type safety here. If you're a TypeScript developer and you care about that. 00:22:49.420 |
So in this case, we have a get weather tool. And then we can give that tool 00:22:55.340 |
to the agent and we can change this to what is the weather in Tokyo is what cursor wants to check. 00:23:04.620 |
let me move this slightly. We can see it's going to take a bit longer now. And that's because it ran some 00:23:11.900 |
tools. And now it's telling me the weather in Tokyo is sunny. And if you're wondering, well, did it actually run a tool? 00:23:27.580 |
and look at the trace. We have a my agent here. 00:23:35.180 |
And then there we can see it ran, tried to call the tool, executed the tool and got the weather in Tokyo's sunny back, 00:23:45.420 |
and then took the response to generate the final response. 00:23:50.380 |
So the traces dashboard is a great way for you to see what actually happened behind the scenes. 00:23:54.460 |
How are we feeling? Can I get a quick temperature check? Are people able to follow along? I see Zeke is 00:24:02.780 |
giving a thumbs up there. So this is a text based agent. I wanted to show you this just to get a bit 00:24:10.460 |
familiar with the overall agents SDK so that we can jump into building voice agents. 00:24:18.060 |
The first thing we need to understand about a voice agent is the slight differences between 00:24:25.420 |
a voice agent and what we call a real-time agent. Essentially, a real-time agent is just a 00:24:31.260 |
specialized version of an agent configuration. There's just a few fields you can't pass in. 00:24:36.540 |
But they can be used in what's called a real-time session. Because with voice agents, 00:24:41.180 |
there's a lot more things to deal with than just executing tools in a loop. 00:24:45.900 |
One of the most important things is you need to deal with both the audio that's coming in, 00:24:49.500 |
process that, and then run the model with that, and then deal with the audio that's coming out. 00:24:57.340 |
But you also need to think about things like guardrails, handoffs, other lifecycle things. 00:25:02.540 |
And so the real-time session is really dealing with all of that. 00:25:06.140 |
So let me show you how that works. For this, what we're going to do is we're going to go in the same 00:25:13.020 |
project. There's a 02. And it has a page TSX in there. This is a Next.js app that really, 00:25:23.900 |
I just gutted to have the bare minimum in there. But this is a great way for us to just build both the 00:25:30.060 |
front-end and the back-end part of the voice experience. Because this voice agent that we're 00:25:37.420 |
going to build is going to run in the browser. In order to make sure we're not leaking your API 00:25:41.980 |
credentials, one of the important things is you need to use an ephemeral key. That is a key that is 00:25:47.820 |
short-lived and is going to be generated by your server and handed off to your client so that they can 00:25:53.100 |
use that to interact with the real-time API over a protocol called WebRTC. For that, you should see 00:26:01.100 |
a token.ts file in your repository that just calls out to the real-time API to generate a session and 00:26:09.180 |
then return a client secret, which is that ephemeral key that we can then use to authenticate with the 00:26:14.060 |
SDK. You do not have to do this if you're building a real-time agent that is running on your server. For 00:26:20.540 |
example, in the case of Twilio app or something else where you can just directly interact with the 00:26:27.740 |
OpenAI API key. But if you're running anything in the browser, then you actually need to generate 00:26:34.780 |
this client key just so that you're not, you know, giving your API key to the world. So with that 00:26:41.900 |
in here, we can actually go and build our first real-time agent. So similar to previously, we're going to 00:26:48.460 |
import an agent class here. But in this case, it's going to be a real-time agent. 00:26:53.820 |
We're going to import it from the real-time package, 00:26:59.180 |
which is just a sub path in the same package. So you don't need to install a different package here. 00:27:05.260 |
But now we can define a real-time agent that works the same way. We have a name. We give it instructions. 00:27:13.980 |
Just sort of going with a default suggestion here. And now we actually need to connect that agent to a 00:27:20.220 |
real-time session. So I have a connect button here for running this example. Let me start it up here 00:27:26.540 |
with npm run start 02. That command should be in your readme as well. 00:27:33.740 |
It's going to start a development server. And we can go over here, reload this. And you can actually see 00:27:39.820 |
it just has like a little connect button that right now doesn't do anything. 00:27:46.460 |
So let's connect that up. I don't need this anymore. Let me just move this to the side. 00:27:53.180 |
So in this on connect function that gets triggered whenever we press the button, 00:28:00.860 |
we want to deal with that connected state. So what we're going to do here is we're first going to fetch 00:28:08.140 |
that token. And this code, what this basically does is it's going to import that server action, 00:28:14.540 |
which is the next JS concept that just makes sure that like this code is going to run on your backend. 00:28:19.980 |
If you're using a different framework, you should be able to just go and fetch this key from your 00:28:25.420 |
backend server. And then once we have that token, we can go and create a new real-time session. 00:28:33.580 |
So what we're doing here is we're going to give it the first agent that should start the conversation up. 00:28:39.020 |
I'm going to specify the latest model that we released today along with the agent's SDK. 00:28:45.420 |
This model is a if you've used the real-time API before, it's an improvement, especially around tool 00:28:50.860 |
calling. It's much better on that front. We have a couple of different customer stories on our Twitter, 00:28:56.780 |
if you want to check that out. And then I'm going to give it not there. I don't know my 00:29:03.420 |
cursor insists on that. The last step that we need to do is we need to connect to that session. 00:29:08.540 |
So this is where we're going to give it that API key so that we can connect to the real-time session 00:29:16.940 |
under the hood. Just so that it's easier for us to deal with all of this, I'm also going to close 00:29:22.940 |
the session. But I've got one thing here that is an oddity of React. We do not want to generate that 00:29:31.580 |
session every time. So I'm going to on every re-render. So I'm going to create a what is 00:29:38.060 |
called a ref here. Again, if you're new to React, this basically is just a variable that's going to 00:29:44.620 |
persist through re-renders. So we need to slightly change this here. We're going to assign that to 00:29:49.740 |
session.current so we can maintain that. And then that also allows us to say if there is a session.current 00:29:57.420 |
set, we want to actually close that connection when we press the disconnect button. That just 00:30:02.460 |
makes sure that we're disconnected from the audio again. So I'm going to leave that on the screen for 00:30:09.660 |
a second, and then we can test this out. But if you already typed this, go into your browser, refresh, 00:30:15.260 |
press connect, and you should be able to talk to your agent. 00:30:18.140 |
All right, let's try mine. Let me move this to the other side so it's not blocking your code. 00:30:37.900 |
Hello? Hi there. How can I assist these days? All right. So you can see it. It's just a few lines 00:30:46.940 |
of code. We didn't have to deal with things like figuring out how to set up the microphone, how to 00:30:51.420 |
set up the speakers. By default, if it's running in the browser, it will deal with all of that 00:30:56.380 |
automatically. If you do want to pass in your own microphone source or other things like that, 00:31:00.060 |
you can do that as well. If this is running on a server, you have both a 00:31:08.380 |
send audio function that allows you to send an audio buffer in, or you can listen to the audio 00:31:18.380 |
event, which is going to emit all of the audio buffers that are coming back from the model so that 00:31:25.420 |
you can pass it to whatever your source is. So that's our first basic agent. Any questions so far? 00:31:35.020 |
Please update the Rappo. Can you send that code to the Rappo? 00:31:42.780 |
You want you to send the code to the Rappo? Can you push it? Can you push it? 00:32:11.180 |
let's go and actually give it a tool. So this is really where the benefit of the 00:32:18.220 |
agents SDK comes in. We can actually use that same tool definition that we did earlier. So I'm just 00:32:23.980 |
going to follow the autocomplete here. We should be able to just give that tool now to our agent 00:32:31.820 |
and save. I need to import Zod again to do that schema validation. This is especially important on the 00:32:41.660 |
real-time side because the real-time model currently does not support strict mode. So the JSON 00:32:48.060 |
might not fully comply with your schema unless you're giving us a Zod schema and we'll go 00:32:57.740 |
and validate that this actually fits that schema. So that makes your code a bit easier. 00:33:03.020 |
So with that we can go back. Hey, what's the weather in San Francisco? 00:33:14.700 |
We can disconnect it here. Also, this does now deal with interruption. So, 00:33:21.900 |
what's the weather in San Francisco is the weather in San Francisco. 00:33:29.580 |
So the weather in San Francisco is sunny today. 00:33:31.660 |
So the weather in San Francisco is sunny today. 00:33:33.660 |
So the weather in San Francisco is sunny today. 00:33:35.580 |
So the weather in San Francisco is sunny today. 00:33:37.580 |
So the weather in San Francisco is sunny today. 00:33:38.940 |
So the weather in San Francisco is sunny today. 00:33:40.380 |
So the weather in San Francisco is sunny today. 00:33:42.460 |
So the weather in San Francisco is sunny today. 00:33:44.380 |
So the weather in San Francisco is sunny today. 00:33:45.900 |
So the weather in San Francisco is sunny today. 00:33:47.340 |
So the weather in San Francisco is sunny today. 00:33:48.380 |
So the weather in San Francisco is sunny today. 00:33:49.580 |
Normally, that's enough to deal with the context. 00:33:52.780 |
But it is super crucial to have that interruption timing. 00:33:56.140 |
So that like your model doesn't think it read out like the full customer policy. 00:34:00.300 |
But the customer interrupted it halfway through, for example. 00:34:05.660 |
Question. You don't have to manage all the events to actually do that anymore? 00:34:12.060 |
No. So the real-time session will handle all of those events. 00:34:16.940 |
What we can do is listen to the transport event. 00:34:34.860 |
This will log out all of the events that are happening under the hood. 00:34:40.460 |
So if we open the dev tools here and rerun this. 00:34:52.220 |
So you can see all of the events that normally you would have to deal with are being dealt with. 00:34:59.980 |
So you can both read them, but you can also send events yourself. 00:35:05.580 |
But continue to pass them on to you if you want to do your own logic on top of that. 00:35:10.700 |
I'm going to push that code for you so you can pull it. 00:35:32.220 |
Since we already have this commented out code, the other part of this that typically is a request 00:35:37.340 |
that you want to deal with is I want to show like the transcript. 00:35:41.420 |
I want to see what sort of is being transcribed. 00:35:44.140 |
And the important thing here is I'm using the word transcribe because even though the speech-to-speech model 00:35:49.180 |
is dealing with the audio directly and there is no transcription step in between, by default, 00:35:55.740 |
we're going to transcribe all of the conversation at the same time. 00:36:02.620 |
If you're using the API directly, you have to actually turn it on. 00:36:06.380 |
In the agent SDK, it's turned on by default because it's such a common request. 00:36:11.420 |
And it enables us to do a couple of additional features that we'll cover later on. 00:36:16.620 |
But this is going to give us that whole history every time. 00:36:24.060 |
Or rather, I'm going to - there we go - import that. 00:36:34.140 |
And then because it's React, we can create a list here. 00:36:42.540 |
I need to filter because it has both tool calls and messages. 00:36:46.620 |
And I only want to show the messages for this. 00:36:48.380 |
So I should be able to - why does it want that? 00:37:14.380 |
So you're automatically getting that conversation. 00:37:17.420 |
If you are interrupting the model, one of the things that happens is the transcript is going to 00:37:22.060 |
disappear, and that's because the model currently does not adjust that transcript. 00:37:31.420 |
And we're going to remove it from that object as well, just so that you get the most accurate 00:37:36.140 |
representation and you're not thinking that, like, the model read out a certain piece of text. 00:37:40.940 |
And again, with everything that we're doing here, we can actually go back into traces. 00:37:46.780 |
And we can see that same representation here with the weather call and everything. 00:38:02.620 |
The question was, how do you store the conversation history? 00:38:13.340 |
So basically, there is going to be a bunch of events that I logged out that are being 00:38:20.460 |
All of those are going to be sent over to the client and then stored in memory in a conversation, 00:38:28.940 |
So you can do whatever you want with that by listening to that history updated event. 00:38:33.340 |
So if you do want to store it somewhere, you can store it. 00:38:36.220 |
The other part is the traces part is automatically going to be stored on the OpenAI platform. 00:38:44.300 |
As long as you both enable that tracing, you can disable it by default in the agent SDKs enabled. 00:38:52.460 |
And then the other aspect of that is if you are a ZDR customer, so a zero-data-retention customer 00:38:58.300 |
of OpenAI, you don't have access to that traces feature. 00:39:01.660 |
The question was, how much of the, like, voice context, how much of the previous conversation 00:39:24.780 |
That's going to depend and sort of, like, dealt with directly by the real-time API. 00:39:29.340 |
So, like, the real-time API, when you start that session, 00:39:32.860 |
that holds the source of truth for that whole conversation session. 00:39:38.540 |
So what you're receiving on the client side is just a copy of whatever is happening at that point. 00:39:45.180 |
It's not the source of truth of what we're going to adapt to pass into the model. 00:39:49.820 |
The question is, how does it work with, like, the inference cost and, like, whether you're 00:40:16.780 |
passing it, like, passing in that whole conversation? 00:40:23.340 |
But yes, we're actually, like, you can log the, like, we're keeping track of the usage. 00:40:30.380 |
There's an event that you can, like, log out to see your token cost. 00:40:33.260 |
So you have an idea of, like, what is being actually passed in. 00:40:37.100 |
So, like, with every, if we're going back here to this example, you can see these response done events. 00:40:45.820 |
I don't know, where is the, shouldn't it be on the response done? 00:40:55.340 |
I just do not know right now why it's not showing. 00:41:00.220 |
So you can see here, it outputs the detailed information of your token usage at any point in time. 00:41:08.380 |
So while you don't have, like, access to, like, what is exactly what was passed into the next 00:41:13.180 |
response generation, you can keep track of the cost as it's happening. 00:41:23.980 |
That might be easier than you yelling across the stage to me. 00:41:27.260 |
I see that the format that you're using is PCM 16. 00:41:31.420 |
Is there a way in which we can modify the output formats of the audio files so we can save in memory? 00:41:36.700 |
Um, yeah, there are different, different audio modes that you can use. 00:41:41.260 |
Um, including, like, for example, ULaw for, that is, like, helpful for phone calls, for example. 00:41:52.140 |
Does that, like, final assistant response roll up all the tokens from, like, all the 00:42:02.220 |
Like, the agent needs to, like, kind of reason through and then format tool calls. 00:42:06.300 |
So I'm assuming it's not just the output tokens for only the assistant response, right? 00:42:10.860 |
It, like, every tool call is a response in general as well. 00:42:17.100 |
So, like, it works the same way that, like, the responses API works, for example. 00:42:24.620 |
Because we're using this and we have, like, tool calls and tool call outputs, right? 00:42:28.300 |
And I couldn't find the, like, usage attribute on the tool call output. 00:42:32.140 |
Is it somewhere in those, like, raw events that are outputted? 00:42:42.780 |
Yeah, do you want to head over to that microphone that is right behind you? 00:42:53.100 |
Can I go back to the slides explaining the different modes of the audio agents, like, 00:43:16.780 |
When you just showed us the GPT-4 real-time, that one... 00:43:28.380 |
When we did the refund, it kind of followed this pattern, where it performs a tool call... 00:43:34.780 |
Like, the, like, real-time API agent can perform tool calls. 00:43:39.500 |
It performed a tool call to trigger a separate agent that was the refund agent that, in my case, 00:43:45.180 |
used O4 Mini to execute that task and then hand that back. 00:43:53.580 |
I'm currently using, like, a regular OpenAI agent. 00:43:59.340 |
So, what will be the challenge that we face when we want to change my regular agents to real-time agents? 00:44:07.740 |
So, there's a couple of different challenges. 00:44:10.220 |
Like, one is, like, anything that you're doing around latency... 00:44:16.140 |
Like, anything you're doing around voice, latency is always king. 00:44:21.580 |
So, like, you want to figure out what are the best ways to... 00:44:27.500 |
Like, when it comes to things like tool calling, you want to find ways to do things like buying yourself some time. 00:44:35.420 |
So, you will typically see some prompting around, like, announce what you're about to do next before you're doing it. 00:44:42.700 |
And that's to do that little trick around while the previous audio is still being read out. 00:44:49.180 |
The agent can already perform the tool call and wait for the tool call to come back. 00:44:54.540 |
Because, similar to a text-based agent, the model can't do... 00:44:58.620 |
Like, can't receive additional data as, like... 00:45:01.180 |
Like, do another thing outside of, like, we can interrupt the response, but it can't finish that response, if that makes sense. 00:45:09.260 |
And so, you want to do these sort of, like, buying time. 00:45:12.700 |
The other thing is, like, if you're building a real-time agent, the longer your prompt gets, 00:45:18.780 |
at one point it increases the likelihood that it gets confused. 00:45:21.900 |
So, you want to make sure you're properly scoping those use cases and, like, through what we call handoffs, 00:45:29.500 |
where you have different agents that are more scope to specific steps in your experience. 00:45:40.460 |
Can you speak a little bit more about memory? 00:46:05.340 |
When we go back to this demo, what you're seeing here is essentially just, like, 00:46:11.100 |
a copy of the events that we're receiving back. 00:46:14.300 |
So, this is, like, helpful as a visualization of the history. 00:46:17.180 |
That being said, the actual, like, memory in the sense of, like, an LM agent memory is the 00:46:24.940 |
session context that is happening on the real-time API side. 00:46:28.620 |
There are events that you can use to update that. 00:46:31.100 |
We actually have an update history event that you can pass in what you want the history to be. 00:46:37.020 |
But what that does is essentially, like, fire off events to the real-time API to say, like, 00:46:42.620 |
delete this item from the history or add this new item. 00:46:48.700 |
So, like, you can, for example, like, slot messages into a specific spot if you wanted to. 00:46:56.060 |
But there's, like, no, like, advanced, like, long-term memory solution like you were alluding to. 00:47:06.780 |
Do you have tips for handling input from low-fluency users? 00:47:10.700 |
Like, say someone who's just learning a language and they have, like, multilingual input and maybe 00:47:15.740 |
broken grammar and their pronunciation is not so good? 00:47:17.980 |
I don't think I have any, like, best practices right now that I could share. 00:47:26.220 |
It can handle, like, switching languages and things like that. 00:47:31.100 |
But it might not be able to handle low fluency. 00:47:37.660 |
Yeah, we have some customers that are, like, language learning companies. 00:47:47.660 |
So, there is some that are using it that way. 00:47:49.980 |
But I don't think I have any, like, best practices that I can share. 00:47:58.540 |
Back in the code, is there a callback for the interrupt? 00:48:05.340 |
Um, there is a call for the interrupt, but, uh, there is no, um, 00:48:18.140 |
There's no param or e that comes with it or anything like that? 00:48:23.500 |
So, what you can do is, if you're getting this, you can call -- 00:48:34.460 |
Um, so you can -- you have access in that moment. 00:48:37.420 |
The thing that we do have is, um, for tool calls specifically, 00:48:48.300 |
and that context has a, um, history parameter 00:48:55.500 |
Um, it's more documented in the -- in the documentation. 00:49:11.820 |
is you can reuse the same syntax that you're doing with text-based ones. 00:49:15.340 |
Um, it's also a good way for you to then communicate with your 00:49:22.060 |
Um, follow sort of a, um, general practice around, like, 00:49:30.060 |
keeping both the tool calls as low latency as possible. 00:49:36.700 |
Like, for example, if you know a task is going to take longer, 00:49:39.980 |
start the task, give it a task ID, and have the agent have a tool to check on the status, 00:49:47.020 |
Like, that helps getting back to it because, again, while the tool call is going on, 00:49:54.940 |
So, you want to -- you want to make sure to, like, get back to that as soon as possible. 00:49:58.940 |
Um, one of the other things that you can do is human approval -- uh, human approval. 00:50:03.660 |
I can show you that quickly. There's essentially a use, uh, it's a needs approval 00:50:10.300 |
that, um, you can either specify as a function that will be evaluated before the tool ever gets 00:50:17.820 |
triggered. This is a great way if you have, like, a more complex logic on "I need approval for this." 00:50:23.820 |
You can also give it just straight up, "I always need approval," at which point there is a 00:50:30.300 |
another event here, um, tool approval requested, and then that gets a, um, event here, so we can 00:50:43.020 |
do things like, um, good old prompt. Um, and then we can go and approve that tool call again. 00:51:02.220 |
I don't know why the autocomplete is not working. Um, um, proof. There we go. And, uh, 00:51:29.820 |
this is where I go into the docs because I do not remember why this is autocompleting the wrong way. 00:51:40.140 |
But everything I'm showing you is in the docs. Um, so we can just -- oh, took the wrong thing, right? The first -- 00:52:04.540 |
Approval request. Thank you. It's like the classic thing when you're on stage and you can't really -- 00:52:12.620 |
there we go. So, in this case, I'm just going to always approve. But if we now go in, 00:52:24.620 |
go in, "Hey, um, can you tell me the weather in Seattle?" 00:52:29.500 |
So we can, in that case, approve it. It's always going to approve right now because I'm not actually 00:52:38.620 |
checking the status. But, um, that means you can build, like, a human in the loop approval experience. 00:52:44.220 |
This is really convenient, especially if you're running it in the browser and you just want to have, 00:52:48.700 |
like, a confirmation of, like, the tool is hallucinating things before the customer 00:52:53.580 |
And does it do it directly? Can it actually say, "Are you okay if I do this?" 00:52:58.460 |
So basically, this is happening -- so the question is, does it automatically do this? Like, 00:53:08.700 |
the -- what we're doing and the reason why this is separate is the model is asking for this tool to be 00:53:15.900 |
executed. But we're intercepting this, basically, before we're ever generating or executing the response. 00:53:22.300 |
This is intentional so that, like, you don't have to deal with -- like, we want you to think through 00:53:29.340 |
why should this tool need approval as opposed to doing that somewhere halfway through your tool execution. 00:53:35.260 |
And you have to, like, deal with the consequence of rolling back every decision that you've made, 00:53:39.740 |
for example. And so, by default, if this is just needs true, it cannot get past that until the execution 00:53:47.340 |
was approved, at which point it stores it in the context that is stored locally and then bypasses 00:53:53.100 |
that security. So this is not happening on the model level. 00:54:01.660 |
So the other thing we talked about already, but I want to show it in practice, is handoffs. So a handoff 00:54:09.500 |
is essentially just a specialized tool, a call that resets the configuration of the agent in the session. 00:54:17.660 |
So that we can update the system instructions, we can update the tools, and make sure that we can 00:54:24.460 |
nicely scope the tasks of what we're trying to solve. So what you cannot do -- I know people are probably 00:54:33.260 |
going to ask about this -- is you can't change the voice of the agent mid-session. You could define 00:54:40.380 |
different voices on different agents, but the moment that you're, like, the first agent that starts talking, 00:54:46.460 |
that's the voice that we're going to stick with throughout the entire conversation. So that's a caveat 00:54:52.300 |
to just keep in mind. But they're still very helpful to, let's say, have a weather agent here. 00:55:05.900 |
We'll do this. And then what we can do is we can actually give it a handoff description. So if you 00:55:08.540 |
don't want to have this in your system prom, but you just want to help them, we'll do this. And then 00:55:18.380 |
we can actually give it a handoff description. So if you don't want to have this in your system 00:55:23.180 |
prom, but you just want to help the model understand when to use this, you can say, like, this agent is an 00:55:44.060 |
expert in weather. And then this one is going to have that weather tool. We're going to remove it from this one, 00:55:50.780 |
and we're going to give it a handoff instead to that other weather agent. 00:55:54.140 |
So now if I'm going to restart this. Hey, can you tell me the weather in New York? 00:56:02.460 |
The weather in New York is sunny, so you might want to grab your sunglasses if you're heading outside. 00:56:11.340 |
Enjoy the day. All right. That's that's the model's best attempt at a New York accent. 00:56:18.540 |
We'll take it. But you can see there that, like, it automatically handed off from that first agent to 00:56:24.060 |
that second one and let it handle it. You can, through prompting, do things like, do you want it to 00:56:29.580 |
announce that it's about to handoff? Do you not want to do that? Sometimes it's a bit awkward if you're 00:56:34.460 |
forcing it to always do it. So, like, I would not necessarily try it, but maybe that's the type of 00:56:39.740 |
experience that you want to have. So that's handoffs. Let me do you a favor and push that code. 00:56:47.020 |
Bush, maybe Bush. So the agent can't change the voice when passing to another one, but it can't 00:56:54.780 |
change accents? Yeah, so that's a good question. The agent can't change the voice, but it can change the 00:57:04.540 |
accent. Again, this goes back to, like, the model is a generative model. So you can prompt it to have 00:57:15.340 |
different, like, pronunciations, tonality, like, voice in that sense, but it cannot change the voice model 00:57:25.340 |
that is actually being used to generate that output. 00:57:30.220 |
So maybe as a extension of that, is it, like, the whole real-time request body that can't be changed 00:57:37.340 |
or just the voice? So can I, like, create a tool that could adjust the speed if someone was saying it's 00:57:43.180 |
talking too fast for the noise reduction? You should be able to change. I have not tried the speed 00:57:50.060 |
speed parameter changing at mid-session because it literally came out today. I don't know. Anoop 00:57:57.180 |
of you tried this. No. But, like, you can, like, essentially a handoff does change the session 00:58:03.500 |
configuration. Like, if we look back at, like, the, like, one of the transcripts here, like, now that we 00:58:13.020 |
have a handoff. Let's go to this trace. So you can see here that, like, it called the transfer to 00:58:21.260 |
weather agent. But then, like, these instructions were talked with a New York accent. So in this case, 00:58:32.380 |
it did change the instructions midway through the session. And the same way, like, when that handoff 00:58:38.060 |
happens, we take away the tools, we give it new tools. So you can change those tools. You could 00:58:44.300 |
have a tool to change the tool, but my recommendation would be, like, use a different agent for that. 00:58:49.420 |
But then, like, the speed control, like, you could, you should be able to send off an event, but I have 00:58:55.180 |
not tried that. Yeah, or maybe, like, like, the background. Like, basically, like, if you had 00:59:00.380 |
something and someone was, like, in a noisy environment, like, hey, you seem to be getting 00:59:05.900 |
interrupted. Could you adjust what we catch in the background and start adjusting that parameter 00:59:11.180 |
so just the voice is protected as far as possible? Yeah. So the question is, like, for example, if someone 00:59:17.900 |
is, like, in a noisy environment, like, could you have the agent detect that and then use, 00:59:23.500 |
like, adjust some of the session configuration to deal with that and just the voice is protected? 00:59:30.060 |
I don't know, honestly, which parameters are protected or not. The good thing is, like, the API will throw 00:59:36.060 |
an error if the thing didn't work. So it's a good way. It's a good thing to experiment with. 00:59:41.100 |
You could do that in Python. Hmm? With the previous ones, you could do that in Python. 00:59:47.580 |
Oh, yeah. Well, with the Python Agents SDK, we're doing the chained approach. We don't have a real-time 00:59:55.340 |
built-in yet. So... Just calling the old API in Python, you could change it. Oh, all right. Yeah, 01:00:03.500 |
yeah. Then it should work. Yeah. If you can do it in Python, like, it should just work. 01:00:09.980 |
Cool. So the other thing we talked about is this delegation part. So that's what I, like, 01:00:19.740 |
had mentioned earlier that was in the diagram. So this is essentially where you want to be able to 01:00:26.460 |
have certain complex tasks dealt with by a more intelligent model. And the way we can do that is 01:00:33.580 |
essentially just creating another agent except on the back end. And because the TypeScript SDK works both in 01:00:42.220 |
the front and back end, we can do that through -- I think I have a -- let's see if we have a file here or not. 01:00:52.220 |
We can do that using the same SDK. So I'm going to create on the -- in the server folder here, a new 01:01:00.300 |
file I'm going to call just agent. And in here, we can build our regular text-based agent. So this is 01:01:09.660 |
essentially the same code that we've done before. And we can say -- this is a -- I don't know -- 01:01:20.460 |
called the Riddler. You are excellent at creating riddles based on a target demographic 01:01:36.380 |
and topic. And we'll just give it a model of O4mini. 01:01:42.940 |
Also, a reminder for those, if you are trying to follow along and you run into troubles, post in 01:01:53.100 |
the Slack, and Anoop can help you with that. So we have that new agent here. We're not going to give it 01:01:59.740 |
any tools or anything. And then we can export a function here 01:02:05.180 |
that we just call a run agent. And this is just going to take some input and then return that output. 01:02:17.660 |
And we can go back into our front-end code, create a new tool here, create riddle. 01:02:27.100 |
And this one, we're just going to have -- take like two parameters, the demographic and the topic, 01:02:33.740 |
and then call out the run agent function that is going to run on the server. 01:02:42.140 |
We can pass in -- actually, realize I didn't specify this. Let's do demographic and topic. And then create 01:02:50.940 |
an input here of this. The other thing you want to do when you're using server actions in 01:03:01.420 |
next is put that used server at the top. That makes sure that this file executes on the server. 01:03:08.460 |
And then we can pass in that demographic. And again, if you're using a different framework, 01:03:14.140 |
this is just the equivalent of a fancy fetch request. So if you want to do an HTTP request to 01:03:19.740 |
your server, if you want to maintain a WebSocket connection to your own backend, you can do all 01:03:25.100 |
of those things to talk back to other systems. So with that, we can give that to our main agent. 01:03:37.900 |
And then what you want to do in these cases is -- you can tell it like announce when you are about to do 01:04:00.780 |
a task. Don't say you are calling a tool. Things like that can be helpful to like buy itself some time. 01:04:11.820 |
Hey there. Can you tell me a riddle for like a five-year-old Star Wars fan? 01:04:26.220 |
Hey are you there? I'm still here. I'm working on creating that Star Wars riddle for you. It should 01:04:34.220 |
be ready in just a moment. Here's a riddle for your little Star Wars fan. I'm not -- So you can see 01:04:41.340 |
that like because it announced that like what it's about to do, the tool call came back before it even 01:04:49.260 |
finished what it previously said. And so like that's again one of the benefits of like if you can get your 01:04:54.940 |
agent to balance out that and like buy itself some time, this is a good way to deal with the more 01:05:03.420 |
complex tasks. And like it also means that you can like for example take all of the like more reasoning 01:05:10.700 |
heavy workloads and take it out of the voice agent model. For delegation, is it possible to delegate to 01:05:20.220 |
more than one agent like simultaneously or is it just one in the current SDK? 01:05:24.380 |
You can -- it's tool called. So like I think you like you would have two options, right? Like you could do 01:05:31.500 |
like parallel tool calling or you could like have one tool that then triggers running multiple agents, right? So like 01:05:42.700 |
like my recommendation would be that part potentially so that you're not relying on the model making the 01:05:51.180 |
full tools at the same time. Like you want to make the decision making for the voice agent always as easy as 01:05:57.100 |
possible. All right, thanks. Yep. Previously when you did -- you go back to the previous page, the -- 01:06:04.940 |
Which example? The one you're running where you had the output. Oh, here. 01:06:11.260 |
So on line three there where it said I'm still here. Yeah. Is that coming from your SDK? Do I have to use the SDK to do that? 01:06:19.580 |
It's the real-time API. No, it's the real-time API. It responded because I asked like, hey, are you there? 01:06:24.700 |
So it -- it -- it realized that like it didn't start anything and like interrupt and it was like, hey. 01:06:35.100 |
Yeah. So the thing is that like I didn't render out the tool calls in here, right? So like what basically happened between this 01:06:49.180 |
And this was like it started off a tool call and then that tool call because I interrupted it then 01:06:57.100 |
It stopped that tool call because I interrupted it. It stopped the generation. It also reset that like transcript here. 01:07:06.140 |
It's a good indicator that like the interruption happened. And so when I said this it remembered it was trying to call a tool did that tool call 01:07:17.340 |
So that's all the just the the regular real-time API 01:07:30.700 |
What's the cost per minute? We charge per token. There's some 01:07:33.900 |
Translations. I don't know a nuke. Do you have the 01:07:40.060 |
All right, so it's more expensive than TTS and 01:07:53.500 |
And like speech attacks chained up with a model something like in most case, but it depends on the use case 01:08:02.620 |
And sort of like real model choices and stuff 01:08:04.700 |
So a bit harder to say like what the per minute pricing is because again, it's by tokens 01:08:10.540 |
And it also depends on like if you have like transcription turned on and like how many function calls you have and things like that 01:08:17.340 |
Because it's a mix between audio and text tokens 01:08:22.380 |
So one of the interesting things and this is not a thing in the in the regular API 01:08:28.860 |
This is a agents agents as dk specific things 01:08:31.660 |
Is guard railed so like the agents sdk both in python and typescript 01:08:36.860 |
Has this concept of guard rails that can either protect your input or your output 01:08:41.260 |
To make sure that like the agent is not being meddled with or does things that are against policy 01:08:46.540 |
We took that same pattern and moved it over to the real-time side 01:08:51.580 |
Um, where essentially we're running these guard rails that you can define 01:08:55.900 |
In parallel on on top of the transcription at all times you can kind of specify you can see it at the bottom here like 01:09:03.740 |
How often you want to run them or if you only want to run them when the full transcript is available? 01:09:08.060 |
But this is a great way for you to like make sure that the 01:09:11.580 |
Model for doesn't violate certain policies you want to make sure that these run 01:09:16.460 |
As efficiently as possible because they're still running in the client 01:09:20.620 |
But this is a good way to stall fix or like stick to certain policies and if it violates those 01:09:26.540 |
It will interrupt it now there is sort of the bit of the caveat a bit of a caveat where because we're running this on a transcript 01:09:37.900 |
Results in a bit of a timing aspect where if it if it would violate 01:09:46.620 |
Chances are it will say those first couple of words 01:10:00.780 |
Complete or like the text output is not really a transcript the text output will be done 01:10:06.060 |
Before the audio is done speaking every like is done saying everything 01:10:16.860 |
So in that case like to give you an example of like this is a guard rail that just checks like is there the word dom 01:10:25.660 |
In this case like if I would ask it like hey, please call me dom 01:10:29.820 |
Chances are it will call me dom and then self-correct 01:10:32.860 |
If I tell it to tell me a story and only introduce dom in the second act 01:10:43.740 |
It was trying to do that in a much at a much earlier point because that transcript is going to be 01:10:48.700 |
Done before the audio is being complete before the audio is being read out to the user 01:10:56.620 |
But instead the model is going to be like, okay. I'm sorry. I I couldn't help you with that 01:11:00.540 |
Like let's do something else instead and you can give it 01:11:11.100 |
What it should do instead so you can give it these this like output info where you can inform the model like 01:11:22.540 |
No, um, you can choose what uh, so the question was is the transcript still done with whisper you can you can switch the transcript models 01:11:33.420 |
two models one is gpt for mini transcribe and 01:11:38.380 |
With two right yeah, and gpt for gpt for transcribe 01:11:45.260 |
I was trying to remember like we have only one text-to-speech model, but we have two transcribe models for transcribe models 01:11:56.460 |
This is the main part of what I wanted to walk with you all through so 01:12:04.460 |
Post all of the links and the slides in the slack channel which let me go back to 01:12:22.300 |
Resources so if you want to check them out afterwards 01:12:24.780 |
I'll also like I already put a bunch of the resources that I talked about 01:12:29.740 |
Into the bottom of that starter repository, so you should have access there as well 01:12:34.540 |
And I'm happy to hang around answer any questions 01:12:45.020 |
My understanding is for text for the 90s, broad casting of the neighborhood, it goes to the same font relation 01:12:56.620 |
Yeah, the question was around how prompt caching works um and sort of whether we like for prompt caching that like we guide 01:13:01.660 |
Yeah, the question was around how prompt caching works um and sort of whether we like for prompt caching that like we guide 01:13:03.740 |
The requests to the same um um same system again to run and whether like there's any control 01:13:19.660 |
With that with real time because latency obviously matters um I don't think there's any controls about that um 01:13:40.140 |
No, I'm getting a no from there, so like I don't think there's any any controls around that right now 01:13:50.380 |
Hey, so we all know that having natural conversations is involves more than just spoken words it involves 01:13:59.180 |
detecting emotion and adjusting tone um it also involves a cadence and 01:14:06.380 |
Even humming to let the other person know that you are listening 01:14:11.980 |
I wonder if the current speech to speech model is capable of having that kind of natural conversation 01:14:18.380 |
Um part of this is like a prompting challenge so like it definitely can have pretty natural sounding conversations and like 01:14:30.860 |
The part where I highly recommend to check out the openai.fm page 01:14:36.620 |
um because this is like it's it's pretty interesting to see sort of if we go to 01:14:54.140 |
That I normally call out on the playground if you're just getting started with real time and you don't even want to write any line of code 01:15:03.020 |
Um, this is a great way to just have conversations and try things out 01:15:08.940 |
It has a couple of system problems one of my favorite ones to show sort of this is the bored teenager 01:15:19.260 |
Hey there, um, so i'm at ai engineer world's fair and everyone is super stoked about voice agents 01:15:28.460 |
Can you show me some excitement of this whole thing launching today? 01:15:47.900 |
Not really my thing to get all hyped up about it 01:15:50.940 |
So you can see in this case like it it put its own pauses in there and stuff like this wasn't a pause because the model was waiting right like 01:16:01.180 |
Um, it can deal with a lot of that sort of adjusting tone and voice and they can do similar things 01:16:07.740 |
Like reacting to like someone talking in and stuff 01:16:16.860 |
uh api that's been released is what else has changed so managed it's improved is there anything else that's improved in terms of tone detection or voice? 01:16:27.180 |
We primarily released like a new base model for or like a new model for the 01:16:34.300 |
gpt4 real-time model that is just better at function calling and has been 01:16:46.380 |
Can you inject audio as a background audio like ambient audio like typing audio and so on? 01:17:00.140 |
Right, yeah, um, you can just intercept the audio that is coming back from the model and then like overlay your own audio 01:17:09.020 |
Yeah, hi, um, I was wondering what your support is for like multiple speakers 01:17:15.100 |
if there's more than one person in a conversation can it detect who's talking and and do pauses that way? 01:17:20.220 |
There's no no current like speaker detection in the model 01:17:32.860 |
I just wanted to ask um about like custom voices is it limited to the preset voices you have or can I upload my voice as a sample for instance? 01:17:41.420 |
It's currently limited to the voices that we have we keep adding new voices though 01:17:46.940 |
Is there going to be support to add custom voices anytime in the future? 01:17:55.500 |
What I mean what I can say is like we're trying to make sure we're finding like the safest approach 01:18:02.380 |
Like we have an article on online that talks about sort of like the responsible take we're trying to take on this 01:18:08.860 |
On making sure that like the voice the voice like if we're providing custom voices that it comes with the right guardrails in place and stuff 01:18:17.100 |
To avoid abuse all right. Thank you. You're welcome. Yes 01:18:27.020 |
I don't think that has changed um to my knowledge 01:18:32.860 |
My personal recommendation would be that like one of the things that you can do 01:18:37.980 |
And this goes back to like for example in the 01:18:44.380 |
Um if you're keeping track of the transcript and stuff 01:18:47.660 |
You can read like when you're starting a new session 01:18:52.060 |
You can populate that context by creating new items using the api 01:18:57.820 |
So like one of the things that you could do is starting a new session if you know what the previous context was because you kept track of it 01:19:03.980 |
You can then um basically inject that as additional context 01:19:11.740 |
Um, what type of event are you looking for? Oh, if there's a timeout event, um 01:19:20.060 |
But all of our events are documented in the api reference 01:19:26.220 |
Yeah, so when you see a real-time api can code tubes and function calls 01:19:31.340 |
Uh, really includes like system file reading and writing in those packages? 01:19:38.780 |
Can you sorry, can you repeat that question one more? 01:19:41.020 |
Oh, I was wondering if a real-time api can use function call 01:19:45.020 |
uh functions such as system file writing or reading 01:19:49.180 |
Oh, uh whether the function calls can do things like system file reading and stuff 01:19:53.820 |
I would say it depends on where you're running that a where you're running that 01:19:57.900 |
Real-time session so if you're running it on the server you can do anything you can do on the server 01:20:04.780 |
Then you're limited to whatever things are available in the browser 01:20:13.740 |
Um, no, you should be able to like you could create a like web socket based like 01:20:22.220 |
On on your device. I mean like it's going to use the real-time api for the model 01:20:27.100 |
But then like because the actual tool calls on goal get executed on your system 01:20:31.500 |
You should have access to whatever aspect of your system your program has access to 01:20:39.100 |
Cool. Yes, so even before we get to voice agents we all need bigger better and more diverse evaluation sets 01:20:47.180 |
Especially for anything around function calling and parameterization 01:20:50.540 |
Do you have any best practices or suggestions for how we now take evaluation into the voice world? 01:20:56.620 |
Should we keep things in text and then turn them you know use text to speech to have voice versions of it? 01:21:05.180 |
Evaluate the the full range of inputs that we would expect users to bring to this 01:21:08.940 |
I mean one of my suggestions would be if you can go to the leadership track go to a noops talk and onwards tomorrow. I think yeah 01:21:15.500 |
He's going to talk a lot more about like additional best practices of what we've learned um in 01:21:21.260 |
Building voice agents. I would say like if you can hold on to the audio like it. It's helpful 01:21:30.540 |
Obviously transcriptions definitely but like it's sort of like the audio is still the thing that is like 01:21:36.380 |
The most powerful thing especially for speech-to-speech models where you have 01:21:40.380 |
The model act on the speech not on the text right and like this is one of the one of the few things where like the 01:21:48.620 |
Chained approach obviously makes some of this much more approachable because if you have if you're 01:21:58.380 |
Anyways, and you can just store the text and rerun it that makes that part of evals a bit easier 01:22:03.580 |
That makes sense and then also for those of us who might be thinking about launching a new voice agent 01:22:10.060 |
How would you suggest evaluating it before we get to that stage that we'd have customer interactions to work with? 01:22:16.620 |
um, I would start with like having like a thing goes back to like 01:22:27.180 |
There um like human review is an excellent solution for this right so like have it like have 01:22:37.740 |
A system like like one of the big things with like things like lemonade and stuff 01:22:41.980 |
Like they're able to go through all of these calls and like get an idea, but they also have their own like predetermined set of 01:22:48.780 |
Examples that they might want to test as they're developing the agent 01:22:52.860 |
So like that's a great first way I would scope clearly the problem you're trying to solve as well 01:22:58.060 |
Like if you're if you're trying to sort of boil the ocean it makes a lot of this significantly harder as opposed to like 01:23:04.700 |
Well scoping what the agent should be able to do and what it shouldn't be able to do 01:23:11.820 |
Get getting on the same path of our friend over there 01:23:15.260 |
Actually, when I'm testing my agents my text agents conversational I use prompt full 01:23:21.900 |
It's a platform for do all the proper testing could I put another agent? 01:23:27.980 |
To talk with this this agent to do all the evaluation 01:23:39.420 |
Like so I put another voice agent talking to that agent to try to execute all the prompts and then I could get like the transcription 01:23:52.060 |
I mean, I know we have we have use cases where customers also use our models to like 01:24:00.780 |
Um, where it's like for for like training use cases or other things for example 01:24:05.180 |
So she'll work out, but I don't know if anyone uses that kind of approach and lemonade does 01:24:11.420 |
Oh cool. So the second the second picture is is exactly that awesome. Thank you. You're welcome 01:24:27.500 |
uh, do you have something around wake word detection on the real-time api roadmap or 01:24:32.300 |
patterns for wake words? Oh reports, um, no wake words. So that'll wait activating the 01:24:38.460 |
No, we don't have any wake words built in or anything 01:24:43.900 |
No patterns either like any patterns to avoid costs 01:24:49.900 |
You could basically like what you can do is you can build your like you can turn off our 01:24:56.220 |
voice activity detection and then build your own 01:24:59.100 |
um, and then basically use that so like you could use a model that has 01:25:05.820 |
voice activity detection model that has wake words in it and then like 01:25:11.260 |
Do it that way and basically commit all of that audio to our api and then send like a commit like a commit event 01:25:21.660 |
Awesome. Thank you so much for taking the time and spending the afternoon with me