[Full Workshop] Building Conversational AI Agents

All right. Hello, everyone. I hope you have enough space. No, thanks so much for joining. We're building multilingual conversational AI agents. I know it's a bit of a mouthful. But yeah, hopefully we'll get something going. Yeah. So different than on the poster, we're not from Evelyn Labs, we're from Eleven Labs.

Can I get a quick show of hands? Have you heard of Eleven Labs? Okay. Everyone has heard of it. That's great. So we can do some nice things today. Maybe look at some new stuff. Maybe you haven't played with yet. So this is me. I'm Thor here on the right side of the screen.

I work on developer experience at Eleven Labs. So a lot of it is kind of the conversational AI agents platform. And I also have my colleague Paul here with me. So if you have kind of any questions, you know, throughout the workshop later on, we'll be floating around and, you know, happy to kind of, you know, answer your questions.

So feel free to just put your hand up and we'll be floating around later. Yeah. Paul works with me on developer experience. So if you have any feedback as well, you know, on documentation, examples, developer experience, or generally just the product, do let us know. Yeah. Also this, you can scan the QR code to get the slides.

So there's a couple of resources that are linked in the slides, as well as there's a form where you can fill in your email address if you want to get some credits. So we can give you, you know, a couple of credits for the next three months to sort of, you know, play around with Eleven Labs and kind of try this out.

So, yeah, feel free to just scan this and then, you know, kind of save this for later on. There's the resources linked in there as well. Cool. Yeah. Also, if you're building with Eleven Labs, I recommend you follow our Eleven Labs Devs Twitter account. This is specifically for, you know, updates in terms of API versions, client libraries.

So anything, you know, if you're a developer building with Eleven Labs, that's a good place to follow and kind of, you know, be in the loop with what is happening. And then, yeah, I mentioned, so if you if you scan the QR code earlier, there's also the link so you can tap on the QR code as well to open the form.

And if you just fill in your email address, we will send you after this workshop. So in the workshop, you can get started with kind of the free account. That should be should be plenty. But then, you know, we'll we'll give you a coupon code. We'll send it to you via email for kind of the next three months to to try it out.

Cool. And as I mentioned, yeah, the resources in the slide. So we'll, you know, kind of tap into these later on and then can kind of, you know, get started building. We can see as well kind of what folks are building, you know, since we're looking at kind of multilingual conversational AI agents.

Do you just want to shout out kind of the languages that, you know, you're specifically looking to unlock with conversational AI? Anyone, you know, anything else than English? Any other languages? Portuguese. Portuguese. Very nice. You're looking for like a Brazilian Portuguese accent or cool. Yeah, we can we can look at that.

So we've got Portuguese. Any other languages? Spanish. Yes, we got some Spanish in there as well. That's good. So Portuguese, Spanish. Hungarian. Very good. I'm actually do we have Hungarian right now? We don't. Yes, but I think hopefully soon. So we're working on the version three of the multilingual models.

And I do think we'll we'll need to double check. Not that I promise something wrong, but we might be able maybe not today, but maybe in a couple of weeks we can give you Hungarian. That's great. OK. Any other Mandarin? Hindi. Yeah. Huge population that speaks Hindi. Then again, India has.

50 plus languages, I believe. So we're working on some adding some additional currently with Hindi and Tamil. So we're working on some additional languages there. OK. So we have a good mix that we can play around with. Cool. Yeah. If there's any other languages later, we can, you know, kind of explore those as well.

Maybe one language that probably doesn't get spoken about often enough is barking. Actually, if you want to build applications for the 900 million dogs that are out there. I actually looked this up. You can use our model. So this is our most recent launch. Two months ago now, I believe that the newest model we launched.

So maybe we can can have a little listen. The Dachshund is my favorite, actually. The Golden Retriever. A little bit more. A little bit more. A little bit more. A little bit more. Yeah. A little bit more. So we're going to have a great use case for you. Did anyone see this launch like two months ago?

No? Yeah. You saw it? Yeah. The unfortunate thing is we launched it 1st of April, actually. And so everyone thought it was in April Fools. So the timing was a bit unfortunate on that one. But as you can see, you know, text to bark, it's very real. Now, in fact, it was an April Fools.

So be careful when you play this to your dog. They might get offended. Because the context, we can't guarantee that it translates. But the actual sounds that you're hearing are generated by our sound effects model. So we do have a model. If you go to the app, there is this sound effects model here.

And you can actually -- this sound effect was also created. But so truck reversing. Maybe that's a good one. You can click generate. And so basically what we do is we generate kind of four different samples for you that you can use. So, you know, not so interesting for your conversations per se.

But like, you know, if you're creating video games, for example. I don't know if anyone is doing that. We might be having some internet. I hope we have internet. No? Okay. The internet is looking okay. So, yeah, not sure what's happening here. But so, for example, this was like a drum cowbell I generated recently.

Actually, we do have a -- if you Google 11 labs sound board. We built this recently, which was -- which is pretty cool. So you can kind of loop -- it's basically like a drum machine. And you can get kind of the -- sound effects. The drums are pretty nice as well.

And, you know, they are mapped to like the keys on the keyboard. So you can -- you can play like -- So that's pretty cool. And then you can add kind of new sound effects and stuff. You know, just to give you an idea of some of the things that we do.

Now, obviously here, you know, we're talking -- we're talking about conversational AI agents. And so, specifically, if we kind of look at the different components that are involved in building conversational AI agents, we have the user who is speaking, you know, some language. And so we need to transcribe that speech into text.

We then feed that into a large language model, which is kind of acting as the brain of, you know, our agent. So in our case, we currently don't build any intelligence models. So we partner with kind of the existing large language models providers. So like your GPT-4O, your Google, Gemini, what have you.

And then this large language model, which will act as well. kind of the brain of your agent will generate a text output. And then we basically stream that text output back into speech. So that is roughly kind of the pipeline that we have for building conversational AI agents. Now, with that, there is a bunch of system tools that are built in.

So for example, we have this language detection system tool, which facilitates the language switching that we're going to look at in a bit. We also have, you know, function calling, tool calling. So, you know, if you need to give access to kind of specific functionality to your agent, you can, you can do that here as well.

And then, you know, there's different approaches. So like if you've seen open AI real time, for example, so this doesn't actually go through text, right? So it goes sound token to sound token, which has some benefits. But what we've seen, you know, for deploying conversational AI agents at scale, and really understanding, you know, what is kind of happening.

So if you're going sound token to sound token, you're kind of flying blind, you're kind of flying blind a little bit. So, you know, you're trusting the model that sort of it actually replies intelligently. Whereas if you're going through text, you can have, you know, kind of better monitoring and sort of understanding what's what's going on in your conversation.

So while we're also exploring, you know, kind of sound to sound for kind of the conversational AI agents, for now what we found works best is sort of this, this pipeline that we've built. And we deploy kind of all these models very close to each other to kind of bring down the latency as much as possible.

Cool. And so now we can look at kind of the individual sort of components within that. So for example, for speech to text. So this is actually the most recent model we actually launched. You know, that wasn't an April Fools. And so this is our speech to text model.

So our automatic speech recognition model, so ASR. And this works, you know, kind of benchmark leading across 99 different languages at the moment. So what this does, you know, you can see sort of the functionality that's built in there. There's, you know, word level timestamps, there's speaker diarization, there's audio event tagging.

So if you want in your transcript, you know, cuffing, laughing, sort of some of these, you know, audio tags, tags in there, you can enable that as well. And it's all kind of within, you know, structured API responses, which is really nice. So what you can see here is for example, you know, we have a conference call.

So what you can see here is as we're kind of playing this audio, we see that the model recognizes the different speakers, and kind of text them as speaker one, speaker two, and then we have the word level timestamp. So you see, as I play this, Jonas, four week timeline.

you can highlight the different words, kind of, you know, word level here. So that's really, really useful. And, you know, obviously, this is available through the API. So actually, one thing I did, and you can try this out yourself, this is just available kind of for free to sort of demo it.

So if you're using Telegram, I've built kind of this little Telegram bot where you can forward voice messages or videos to, you know, the Telegram bot, and then you, it automatically identifies, okay, what language is that? And it gives you back the transcript of that message. So, you know, you know this all too well, you're sitting in a meeting, and your grandmother sends you a voice message, and you don't know, oh, is this urgent?

Is this important? So what you can do is, you can forward it to the bot, and very quickly, you will get the transcript back. So, if you want to try this out, you know, you can, you can do that. So you can see it here. I was actually, I hope you, you recognize this, if you were paying attention.

Anyone recognize this? Yes, I was just saying that. So I was just recording a voice message on my phone. Thank you. Someone was paying attention. That's great. And so, you know, I was just recording a voice message, and then I get the transcript back. Now, the cool thing as well, so I actually live in Singapore.

So if you spend time in Singapore, you might have heard kind of the Singlish, right, which is sort of the Singapore English. Go to hell, this person. Bastard, you know. You ask for a plastic bag, you must put the thing inside the plastic bag for me, right? She never, she just put the plastic bag, throw the plastic bag on the table.

So, anyone understand what's going on? Do we got any Singaporeans in the house? No? It's not that easy. Even after six years in Singapore, I still sometimes struggle with that. So what we can do is we can forward it to our transcription bot. We can see, okay, it was received.

It's transcribing it now. And yeah, go to hell. Bastard, you know, I asked for a plastic bag. You must put the thing inside the plastic bag for me. So you can, you can see these are kind of the problems we have in Singapore. Singapore. Now, another accent that might be a bit challenging.

Um, Scottish. Hello there. There's some things I need to know about Scotland. Eh, well, you need to know that they're trying to split up the country, you know that. And we don't want that. We want that. We want that. I do some council and that's basically right. That is pretty.

Anyone understand what's going on in Scotland? No? Okay. No Scottish people here. Um, again, we can forward it. We can see. Um, and so this is, this is really cool actually that, um, you know, even without fine tuning the model, it can actually understand specific accents, um, quite well.

So, you need to know they're trying to split up the country and all that. And we don't want that. We want a decent council and all. And that's basically that. There you go. I actually now listen to this so often that I can actually hear it. Um, but yeah, you might, you might not be familiar with it.

Now, one example that's maybe a bit, um, closer to where you're living, if you're here in the US. And Mr. President, what we would like to say about the Bangladesh issue, because we saw and it is evident that how the deep state of United States was involved to regime change during the Biden administration.

And then Muhammad Yunus made, uh, Junior Soros also. So what is your, your point of view about the Bangladesh? And what is the role that the deep state played in this situation? So what are you seeing here is, uh, uh, president has, um, a translator for English to English, um, translation.

So we thought, you know, maybe he can just use this transcription bot. So if we forward that, um, again, so even if the audio quality isn't that great, if there's like a lot of background noise, the model is actually very good at kind of identifying that. Uh, yeah, so this is kind of the, you know, one component to it.

So we need to, um, you know, understand what the user is saying and then feed that into our LLM, uh, to, you know, have kind of a meaningful conversation. So for the other part, um, you know, we, we don't specifically provide the intelligence layer. So that's where we partner with kind of, you know, the leading, um, model providers.

You can also fine tune your own model. Um, you know, if you, for example, fine tune and deploy it on say Google Vertex AI, you just need an open AI API compatible endpoint. Uh, and then you can plug in your custom LLM into this pipeline as well, um, which we can look at in a, in a little bit.

And so now we have the other component. So once the LLM starts streaming out the response, we then want to start streaming the speech as soon as possible. Um, so this pipeline is kind of streaming throughout. So we have, you know, the, the fastest kind of snappiest, um, response possible.

So for, you know, the actual text to speech, what's great with 11 labs, you know, we heard we want kind of, um, Brazilian Portuguese accent, right? So we have a huge library, um, of voices that are available on the platform. So we actually have more than 5,000 different voices, um, that you can choose from.

And so if you go into your 11 labs account, um, you can go to voices and you can explore, um, different voices. Now, if you were sitting, you know, here in this workshop and you were like, Oh, I really like this voice. You're in luck. So you can go to the voice library and you can type in German engineer.

I'm originally from Germany and you find me. True success is doing what you are born to do and doing it well. Does that sound like me? Uh, okay. It's, it's trained on some of my, uh, YouTube videos. So I think maybe I talk a bit differently, um, in the YouTube videos, but yeah, this is, this is great.

And the thing is, so I basically cloned my voice, uh, and I published it on the voice library. And so anytime you use that voice, uh, I get royalties. Um, so this is a marketplace. Uh, we actually, uh, recently just surpassed the 5 million US dollar milestone that we paid out to, um, you know, our voice actors that are kind of publishing their, their voices on the platform.

And so, you know, in this workshop, if you use this voice, uh, I'll be very grateful because then, uh, I can have a coffee later. That's great. Um, no, but obviously, you know, you can use my voice if you want to, you don't have to. Uh, the great thing is you can set really kind of narrow filters to find sort of the voice that, you know, you want.

So, for example, if you want to choose, um, Portuguese, uh, you can just put the language filter to Portuguese and then you can choose kind of the accent. So here, for example, uh, we want to Brazilian Portuguese. We can then further kind of narrow this down in terms of, you know, sort of gender, age.

So there's certain meta, meta tags that we can apply. Uh, and then we can see maybe here. Não há fracasso, exceto em não tentar mais. Okay, I don't, I don't know. I haven't spent much time in Brazil, um, so... Duas amigas partem busca de sossego em uma fazenda orgânica, longe do barulho da cidade.

Does that sound Brazilian? No? Yeah? Okay, a little bit. Um, so, I mean, there's a lot of voices available on the platform, so you can kind of see, um, if you find one that, that sort of fits, you know, the local accent that you're, that you're looking for. Uh, and then what you can do is you can, um, go and kind of put, you know, all these different pieces together into your conversational, um, AI agent.

So here, in the dashboard, we can go to conversational AI, um, and we can actually configure a lot of our agent, you know, right there within the dashboard. And then we can bring that into our application with, um, the JavaScript SDKs, the Python SDKs, kind of depending on, um, what applications you're building.

Um, so we can just do a quick demo maybe of an agent that, um, I had built for, uh, a conference in Singapore. Um, so, you know, if you're familiar with Singapore, there is, uh, four official government languages. So if you're building applications, you know, for Singapore, you actually need to, you know, provide, uh, English, Mandarin, Chinese, Malay, uh, and Tamil.

So these are kind of the official government languages in Singapore. Now, obviously there is other languages, uh, being spoken, you know, Hindi, Japanese, for example, um, you know, as well. But what you can do is you can, uh, we, so we currently have within conversational AI, uh, I think 31, um, different languages that the agent can speak and switch between and identify.

Um, we're working on the next iteration, so the V3 of our multilingual models, which will, uh, bring, I believe, up to 99 different languages. So, um, yeah, you know, no matter where you are in the world, you can kind of get there pretty, pretty good coverage. Um, so you can see here we can configure the LLM, so we can, you know, choose kind of, uh, any of the, the, you know, big LLM providers that are out there.

We have directly integrated, uh, and kind of co-located with, with our platform. Um, or you can put in your custom LLM, which is just an open AI API, uh, compatible endpoint. Uh, you can put in kind of a knowledge base, uh, you can upload documents, you can, you know, reference your website.

Um, there's RAG built in as well, uh, depending on kind of the model you're using, um, you might not want to put kind of the entire knowledge base into the context. So, you can, you can enable RAG there. Um, and then the tools, you know, this is kind of your standard function calling, um, you know, that you might be familiar with when you're working with, um, large language models.

So, depending on the model you choose as well, um, there might be some nuances with how reliably the function calling works. But, you know, kind of the larger ones, GPT-4, um, plus Gemini 2.5, you know, these are generally very reliable with function calling. And then we have the system tools that are kind of built in.

So, you can see here the language detection system tool. So, we can just enable this. Um, and we have this kind of, you know, configured by default to identify different languages and switch between them. Um, but you can also override kind of the, the LLM prompt to sort of customize that to your use case.

Um, and then maybe we can just give that a little test and see how that works. Hi there. This is the API Day Singapore conversational AI agent. How can I help you today? Oh, I'm sorry. I'm not speaking English. Can you speak English? Can you speak English? Of course.

Can you speak English? Of course. Can you speak English? Of course. Can you speak English? Of course. Can you speak English? Of course. You can speak English? Of course. Of course. You can speak English? Of course. You can speak English? Of course. Of course. Of course. One of the 11 labs at Api Day Singapore is Thorsten Shaif, who is listed as a developer advocate.

Would you like to know more about his session or background? Yeah, but what, what languages can you speak? I can communicate in English, Chinese, Hindi, Malay, Japanese, and Tamil. If you prefer to continue in any of these languages, just let me know which one you'd like to use. Would you like to switch to another language now?

Yes. Um, can we switch to Hindi, please? Um, okay. Unfortunately, I don't speak any Hindi. Um, can you repeat that in English? Of course. We can continue in English. Would you like more information about Thorsten Shaif or the 11 labs? So, I guess you, you get the gist. Um, so with the language detection tool, there's kind of two, you know, different modes with this.

So, you can either, you know, as like in the first scenario with my broken Mandarin, I was basically just saying, oh, sorry, I don't speak any English. Can we like speak Mandarin? Um, and it would recognize, oh, okay. You know, even with my terrible Mandarin. And it was like, oh, yeah, he's trying to speak English.

Um, and it would recognize, oh, okay. You know, even with my terrible Mandarin. And it was like, oh, yeah, he's trying to speak English. Um, and it would recognize, oh, okay. You know, even with my terrible Mandarin, it was like, oh, yeah, he's trying to speak English. Okay. So, maybe let's reply in Mandarin.

Uh, so it was doing that. Or we can specifically ask, okay, what languages can you speak? You know, what languages can you speak? You know, what languages can you speak? You know, what languages can you speak? You know, what languages can you speak? And what languages can you speak?

You know, what languages can you speak? You know, what languages can you speak? You know, what languages can you speak? You know, what languages can you speak? You know, what languages can you speak? You know, what languages can you speak? You know, what languages can you speak? So, what languages can you speak?

You know, what languages can you speak? You know, what languages can you speak? You know, what languages can you speak? So, this is kind of the built-in language detection system tool that we can use to facilitate kind of these multilingual conversations, which is really nice. Cool. So, this is kind of roughly what I wanted to show you, sort of as a start.

And now what we can do is kind of we have, you know, the next 30 minutes to play around with this yourself. And we're in the room and, you know, if you have any questions, we can answer them. So, there is various different ways that you can configure your agents.

So, you know, by default, you can get started in the dashboard and you can configure kind of a lot of the behavior and functionality in there. And then, you know, if you go back to the resources, we have, you know, the documentation. We have different examples that you can use around conversational AI.

So, for example, there's, you know, we have examples for Next.js to build that into your Next.js applications. We have examples for Python. If you want to build it, you know, on, like, hardware devices somewhere, you might want to use Python on, like, your Raspberry Pi, for example. So, we have the examples there.

And you can then, once you've configured that, you can bring that into, you know, your application. Alternatively, you can also configure all nuances of your agents via the API. So, actually, if you're building a marketplace where you are configuring agents on behalf of someone else, you know, you would generally do that through the API.

And we also have an MCP server. So, if you're using, you know, Cloud desktop, you can bring in the MCP server. And you can just tell in natural language, oh, please, you know, set up an 11-lapse conversational AI agent, you know, with this voice. And it'll go and it knows kind of what, you know, API calls to make to set up your agent.

But, yes. So, thanks so much for joining. If you have any questions, you know, we'll be floating around. You can also ask the questions now, you know, if you want to ask it kind of in the audience. But, otherwise, you know, please just go to 11lapse.io and create your account if you don't have one already.

And then you just go to app. Go to conversational AI. And then you can create here in the agent's interface, you can create a new agent. And maybe we'll just start off with kind of the support agent here. And then we can go through and kind of configure this agent with, you know, your voices, your languages.

And then, yeah, we'd love to hear a lot of different agents speak, you know, at the end of the next 30 minutes. Awesome. Thanks so much. Do let us know your questions and we'll be here for the next 30 minutes to help you set up your agent yourself. Thank you.

Did anyone have questions that they wanted to ask in the room? Or I think you're welcome. There's like microphones. Yeah, there. Do you want to just go up to the microphone and ask? Hello, everyone. First of all, great presentation. The question is related to you said that it can switch to different languages without fine tuning it.

Like, what is the background process of it? Can you explain it a bit more like how well it shifts so perfectly that it can interact in the regional languages as well as the accent is also similar to that kind of thing? Yeah, so here, what you can see is in my agent configuration, so I can actually, within the voices tab, I can assign different voices to the different languages.

So, for example, in Singapore, the most commonly spoken Tamil accent in Singapore is the Chennai accent Tamil. And so basically, I went to the voice library and I found a voice that is a Chennai accent Tamil. And I then basically added that to my voice library and just assigned that here.

So in the voices tab, you can configure the different voices for the different languages. Which then means that, you know, the, so the actual language detection part, that is the automatic speech recognition model. So the ASR model will actually identify which language, you know, with, so it basically assigns kind of a score, like a likelihood score, that this is the language that is being spoken, as well as the transcript.

And so basically, we use kind of with the system tool, based on the confidence score of this language being spoken, we then automatically switch to their language in the background. And basically use the voice that you configured to reply with for this language. Does that roughly answer the question?

Cool. Yeah, do you mind coming forward? So just, because I think it's also being recorded, then we have it. Great presentation, by the way. Quick question. So, other than the changing of the language or language detection or whatnot, does it have any other ability to make any other actions throughout the call?

For example, use case appointment setting. Does it have actions to maybe call a webhook and to take a look to see if there's any appointments available, either in make or NN and then report back? Like, we can have that in the prompts? Yeah, correct. So, basically, the configuration of that is a combination of your system prompt together with the tools.

So, you can configure custom tools. And so, these can be server-side tools, which then would be a webhook call, you know, to your CRM, to your system. So, this is, you know, the standard kind of tool calling, function calling that the LLM supports. So, you know, like GPT-40 or like the more modern models generally support function calling and structured outputs.

And so, as long as the large language model that you're using to power your agent supports function calling, you can add your tools. And this can be a combination of server-side tools. So, for example, you know, as you mentioned, like the scheduling. You can put in, so, for example, we also have an example with cal.com.

You can put in the API endpoints for cal.com. And then the agent can actually look up, oh, okay, is there availability in the calendar? And it can schedule, so it can ask for like the email address. And then it can schedule, you know, the meeting for all the parties kind of through the conversational AI agent.

Okay, one more question. What would you suggest would be a good structure for a conversational agent that has like super low latency? Obviously, the model plays a big role. So, obviously, like price and model or like price and the latency are like the two biggest key factors when you want to do like a outbound or inbound dialing agent.

What would you suggest for, let's say, an outbound dialing agent or even an inbound to decrease the latency to make it seem more like conversational-like? Because, obviously, if you use a really good model, it's really large. The latency is just super high and it just doesn't really, that's not a meaningful conversation to have.

Gotcha. I think that's a good question. I think we have some, so, if you go to the documentation, within the conversation, within the conversational AI, we have some kind of best practices. I'm not sure if we have like specific... I do that by testing, but I want to skip the testing part.

Yeah. Do we have specific guidance on like the best model? It kind of depends on your use case, right? Yeah. Although, I think it's like, he's asking about like the best LLM to use, right? For... Right. Yeah. I think it depends on your use case. Like, you know, depending on kind of the function calling and, you know, how much of that you have.

You can probably go down to like, yeah, a Gemini flash or flashlight to like reduce the latency in terms of that. But I think on our end, in terms of like the voice models, so we use... Where are the... Yeah. We use, by default, kind of the flash models for the speech generation.

So, depending on the languages that you want to support. Yeah. Yeah. It's just a test. I think testing would... I think I'll be able to figure out with like proper testing with different models, flash on and off, all that good stuff. Thank you so much, man. Cool. Thanks. I have a couple of questions.

So, number one, what's the total cost per minute coming down to? You mean like by default? So, it kind of depends what pricing tier you're on. So, if you go to the pricing page, so here, conversational AI. So, depending on kind of the tier you're on, the, you know, there's a certain amount of minutes that are included.

So, currently, the pricing is based on call minutes. And then, depending on which tier you're on, there is additional minutes that are charged at a specific price. So, it does somewhat depend on kind of the pricing tier that you're on. If you want to have like an application where you need like really long interaction times.

So, you want to do a companion that will talk to you while you cook. Is there something you can do to mitigate that cost? The timing cost? Yeah, that's a good question. I think like for those use cases, it might not be great at the moment. Because like if they are just running this like 24/7 to like be able to talk to someone, the cost is pretty significant.

That's a good question. I think we, I mean, there's potentially like if you reach out to the sales team, there is some like custom pricing that we can do based on the use case. But I think for now, this is charged based on minutes used like that the session, the session is life.

Okay. And final question. I tried to do an agent in your dashboard and it had several tasks. So, it was one onboarding task, then another follow up task. And it kind of got confused. Like it was not like identifying when to do one task, when to do the other.

So, is there a way to mitigate this or maybe have a multi-agent configuration? Yeah. So, there's, we have what we call agent to agent, agent to agent transfers. So, one way to do this is to, you know, basically set this up. And so, this is, this is a system tool as well, where you basically set up different agents for different use cases.

So, like certain use cases, you also might use, want to use a different LLM to power that use case. Kind of, you know, depending on which LLM is sort of best for the task. And then you can configure different agents for different tasks. And then you can configure kind of an orchestration setup that basically will then route kind of in the background to a different agent.

If you keep the voice the same, this actually happens somewhat like silently without the user actually knowing that they're being transferred. So, it's not like, it's an immediate transfer. It just means that you can, you know, sort of develop these agents potentially also across teams where you have, you know, one team that owns kind of this specific agent.

And then in the background, you just kind of switch between the agents for like the different tasks. Great. Thank you. Thanks. Cool. Any more questions? Yeah. Latency. So, in general, like simple examples usually, you know, great for demos, but when you have something more heavy, more enterprise grade, and you have more data, more, you know, your rags taking time to come back.

How do you not have a shitty experience because, you know, think of it from the end user's perspective, right? They're like, ah, it was like, then blank for the next minute or two. Do you suggest fillers? Like, you know, does the conversational AI say, I'm thinking, let me think about it?

Or how do you kind of make it more natural? Because there is going to be like an enterprise setup, right? Like if I have to then go look up a patient's claim, you know, that might go hit a database, once that comes back, it goes to other systems, does a whole bunch of things.

But it takes time. Yeah. Like, how do you set it up so that, you know, latency that can't be avoided? How do we make the conversational experience better? Yeah. So, there's certain things that you can do. So, you know, generally kind of these are, you know, if you have like a big knowledge base, for example, like using Rack is kind of one way of kind of mitigating, you know, that's taking too much time.

And then also in terms of your tools, when you define your tools, you can configure the, where is it? Kind of the response time. So, when you, you know, add parameters, where was the configuration? Where was the configuration? I think there's a configuration for, yeah, like the timeout. So, basically how long you want to wait for, you know, this tool to sort of come back.

And also if you want to wait sort of for the response. So, I think the maximum timeout we allow is like 120 seconds. And then the agent will actually like say, oh, I'm currently looking that up in the system. Sorry, you know, we're still waiting kind of on the response.

So, that is sort of built into the tooling here. Now, you know, depending on your, yeah, like use case, you probably want to, to, to put that kind of, you know, fairly low. Because like, yeah, if you're waiting on the call. I wonder if you can do something where it's like, oh, I'll call you back, you know, like, I'll take that back and like take the action or, but I think for now it would just, yeah, depending on your timeout time, it basically will wait for the response.

And it will tell, like, it will stay conversational to like, you know, talk the user through that, oh, we're still waiting on the tool response there. And are all conversations linear? Or can they branch off, come back? Like, while it's looking up something, can it come back in five minutes later?

Oh, you know, I found, you know, in the meantime, they get other details from the customer or patient or who else? No, I, that's a good question. I don't think so. I think, yeah, currently, you would, I think in that case, you would like orchestrate it in a way where you put it into a queue.

The only thing then is like, how do you update the agent? I think like if you're using the, I would need to kind of look this up, there might be a way, like with the web sockets, where like as information comes back, you can inject that additional information into the conversation through a web socket notification.

But yeah, I would need to look up kind of that specific use case because then what you could do is, you put these tasks into a queue, and you work through them in the background and like as kind of the responses come back, you can then inject them back into the context through kind of web socket events, potentially, but would need to double check if that's feasible.

Okay, that's all I have. Thanks, man. Cool. Yeah, thanks. Sweet. Is everyone, while we're like going through the questions, is everyone setting up their agents? Are we? No? Okay, good. Yeah, I see someone nodding there. Fantastic. Okay, cheers. Yeah. Hello. Could you explain or demonstrate what happens if you have multiple languages within an input and output, and how that might work if you have different accents for each language?

Yeah, so... So you mean like if the person is kind of mixing different languages? Yeah, like for instance, if they say, "Explain the difference between Schadenfreude and 性在热火" in Chinese. Yeah, it's a good question. I haven't tested that specifically. Maybe you can try right now? Maybe you can try right now?

We can, yeah. So what was it? Explain the difference between Schadenfreude and 性在热火。 Okay. Great. Yeah. Okay. Hi there. This is the API Day Singapore conversational AI agent. How can I help you today? Can you please explain the difference between Schadenfreude and 性在热火? Schadenfreude is a German term that means the pleasure or satisfaction one feels from witnessing another person's misfortune or failure.

It is a well-known concept and is often used in English as well. However, 性在热火 does not appear to be a recognized term in German or in psychology, and there is no widely accepted meaning for it. It's possible there may be a spelling error or confusion with another term. Could you clarify what you mean by 性在热火 or provide more context?

That way, I can give you a more accurate comparison. It might be my pronunciation, but it's meant to be a Chinese term. Thank you for clarifying that you meant a Chinese term. Unfortunately, 性在热火 does not sound like a recognizable phrase in Mandarin or other major Chinese dialects, at least based on that pronunciation.

If you can provide the Chinese characters a different spelling or more context about what the term describes or when it is used. I may be able to help you identify. Yeah, it might be my pronunciation, but yeah, the problem is like if you mix, like, you know, well, English and German at least are both like Germanic languages, right?

And Mandarin is so I think there might be my big challenges there. And like depending on the more languages you mix, it does get does get challenging. Yeah. Do you have a recommendation then if you want to build like a language learning application, for example? Yeah, I wonder if there's certain things you can do with like the prompt, the system prompt in terms of like improving how it's being picked up.

But yeah, I think because we're like going through text here. Um, the like the language learning use case is a bit more challenging, especially if you're going, um, you know, you know, Germanic languages versus, um, yeah, it's a good question. I don't have an immediate answer for you there.

But, uh, yeah, actually, you might want to try like a sound token to sound token like open AR real time. I wonder if in that case, you might want to try like a sound token to sound token like open AR real time. I wonder if in that case, you might want to try like a sound token to sound token.

like open AR real time. I wonder if in that case it does better because it can, you know, it doesn't go through text. Yeah, it does. Sometimes it like switches the accents too, which is kind of annoying because it will try to pronounce Chinese for instance in an English accent.

Uh, interesting. Yeah. So yeah, there is, there's challenges with that. But yeah, that's a, that's a good one. I'll, I'll take that back and see, you know, kind of how we, so I know we have some, we have a customer in India Supernova that does, but it's specifically English learning for, um, the Indian market.

Um, so I think it's a bit of a different use case there. Do you know if it produces English in a particular accent or is it like using the, you know, Indian phonetic sounds to, um, I think there's, there's like a case study, uh, 11 labs Supernova. So maybe you can, you can look that up.

Um, there's a video. So maybe. Yeah. Yeah. Yeah. I'll look it up. Thank you so much. Maybe take a look at that. And then we can, yeah, if you, if you connect with us, uh, we can, we can also follow up on kind of some guidance on, on that use case specifically.

Perfect. Thank you. Cool. Thanks. All right. Um, I was curious if you're worried about scammers or fraudsters using these tools. Yeah. So there's definitely, you know, a worry with that, like obviously kind of all this, this technology. So one thing that is kind of very important for us. So you can, if you go to 11 labs dot IO slash safety, uh, you can see kind of the safety tools that we're developing, um, you know, uh, in parallel to, to our, um, features.

So there's a, there's a bunch of things that we do, um, like specifically, you know, we do like life moderation for certain things. So actually when you publish your voice to the voice library, you can specify, um, terms that you don't want your voice to say. Uh, and then in this case, we actually have life moderation where, um, we will make sure that your voice isn't used to generate, kind of specific terms or, um, sentences.

Um, we also kind of monitor in general, uh, what's being generated on the platform. Um, so kind of the moderation and, um, sort of the, the other toolings are actually with any, um, speech that is generated on our platform. We mark the water market, uh, actually to the extent that we can trace back which account generated, um, this specific speech.

Uh, so if we identify, um, fraudulent activity, we can actually trace back which account generated kind of that, um, and, you know, can kind of ban them or, you know, provide, um, kind of information to the authorities, um, as needed. Yeah. And so the other things is just kind of in terms of, um, for, uh, we have like, where was the, we have like the voice capture, um, that we developed when you are creating a, um, professional voice clone.

Uh, we actually generate kind of a random sentence that you need to read out to verify that, you know, you have permission to clone this, this voice. Um, so yeah, we, we do, you know, with kind of all the technology that we develop, um, we do put, uh, quite a large amount of, you know, focus and effort into, uh, safety tooling.

Um, but yeah, there is obviously always a concept. Um, that your technology is being used for fraudulent activity, but I think so far, um, you know, we've been trying to mitigate that with like the safety tooling. Definitely. Yeah. It looks like a lot of good guardrails in place. Thanks.

Thanks. Yay. You're back. You want to go. So just building on her one, like, um, in our place, um, if a patient is asked a, you know, how you feeling about this and say they are, um, they, they try to speak in English. They might hold a part of the conversation in English and then they might jump to Spanish, uh, Spanish and Portuguese and come back to English for, you know, like when they have to describe something that they can't in English.

Uh, they kind of jump back to, you know, the, the language that they are most comfortable in. Sometimes they jump around different languages. Uh, how would, because like the way you explained it, it kind of, you have some kind of a router that checks what kind of language it is and then shoots it off.

But within a conversation, they, they kind of jump between like, you know, they'll explain a few things and then they'll go a few words with a way, you know, these words and stuff. So like for you, this, this is like English, Spanish, Portuguese kind of all mixed together. Yes.

Sometimes like if you just ask them how you're feeling. Okay. Then they come back. Hey, you know, there's this, you know, I took this medication, it hurts, you know, and then if you say, okay, where's it hurting? How? Then they suddenly kind of, you know, they regress to whatever language is most comfortable to them to explain their thing.

Gotcha. Yeah. I mean, yeah, you can see here that like the transcript that actually correctly identified, you know, schadenfreude, um, because technically it's also an English word price. But then like on the, on the Chinese word, it just completely, you know, well, you know, you can blame partly my pronunciation.

Probably you can blame it a lot, but, um, yeah, I wonder if like a native speaker. Yeah. I don't, I don't have exact benchmarks on like, you know, how much like the transcript gets worse, the more languages you introduce kind of in the same. So I think like generally, if you have two languages intermixed, it tends to perform okay.

But like, if you like now having like three different languages, um, it just, you know, progressively tends to, to get worse, but I don't have, um, exact benchmarks on like, you know, how many languages sort of, yeah. Okay. Cool. No, that's, that, that's good to know. That's what I came up with.

But yeah, it would be worthwhile if you have like recordings of like some of that to like put it through our, um, transcription model and see kind of how, how it performs in like identifying that, that, that would be interesting. Yeah. Cool. Thank you. Uh, save this question toward the end.

Cause it's kind of non-related. So, uh, I worked on a project where we use 11 labs, uh, for, uh, the voice track of our avatar. Okay. Uh, and at 11 labs function well, but we had a lot of more downstream issues in terms of like lip sync and like, uh, I think someone mentioned like slugs and timing and other things.

So is there any plan for 11 labs to come like, I guess, further down the stack in terms of like avatars or is, is that even something you're thinking about? Uh, interesting. So you, you, uh, did you build like the, the lip syncing model and like avatar stuff on your, uh, no.

So, uh, we like within the, like Nvidia Tokyo stack. So they have like, uh, the stack and they have their like Reva voice model. And we kind of switched that out for 11 labs. Uh, so they have like the full stack of like the avatar and then 11 labs is just the voice portion of it.

Yeah. Ah, okay. And sorry, which, which stack was that? The Nvidia? Oh, Nvidia Tokyo. Yeah. It's like, yeah, it'll go from, uh, like we did everything with the, the GPU, but then they have like the visualization and uh, you can just plug your voice model or, or in Tokyo, like T O K K I O.

T O K K. Hi, you see, I'm, I'm thinking about Japan. Is it this one? Yeah. Oh, interesting. Okay. Um, yeah, I, I personally don't have, um, experience with that one. So I know that we're mostly working with partners like Hedra and Hey Jen kind of for sort of the, the avatar side of things.

Um, I don't know, Paul, do you know any? No. So this is something I would need to come back to you and like look into. Um, it's, it's interesting. So like you're saying out of the box, it uses like an Nvidia model for speech generation or. Yeah. Yeah. But, uh, our client, which I assume is one of your partners.

We can talk about that. But like our client couldn't use the Nvidia model. Gotcha. Had a contract to use 11 labs model. So. Okay. Interesting. Yeah. Sorry. I don't, I don't have a good answer for you there right now, but yeah, this is interesting. We can, we can go back to the team and see, um, if there's any resources that we can, we can give you in terms of like improving that.

Interesting use case. Thank you. All right. Last question. Last question. Nice. Uh, hi Torsten. Thanks for the presentation. Thank you. I have a question regarding the transcription model around adding custom vocabulary. like, uh, at the company I work for, we use a lot of three letter acronyms and, uh, let's say, let's say I want to have the model read out SAP as SAP and not SAP.

Is there a way to tell it to do that? And is there a way to, uh, like tell it to read words a certain way and kind of nudge the interpretation of what I say towards certain words that we use in our vocabulary? Interesting. Yes. So you have this both.

So you have this use case both on like this, the speech to text. So you need to correctly identify the acronyms, but then also you need the agent to reply back with the correct. So like for the reply back, we do have, um, have you seen the like, um, pronunciation dictionaries?

Um, so we have a way for you to provide, um, you know, pronunciation dictionaries with like kind of phoneme alphabets, uh, to actually, you know, identify specific, uh, you know, basically like here tomato, tomato, I guess, tomato, tomato. Um, and so you can provide for, for that, you can provide the pronunciation dictionary, dictionaries to make sure the text to speech pronounces, you know, the acronyms, and the words in the way that you want them to.

Now the other side of like the speech to text. That's an interesting case. I don't think we have a word. Uh, a way to like fine tune that specifically for different acronyms. That's a good question. So, you know anything there? No, right? Oh, interesting. Yeah. So you, you can, to a certain extent, you can do it through the system prompt where you put in kind of a normalization layer to basically identify things in the transcript that are acronyms and then basically have the LLM sort of massage that into, to what you want.

I think that's what you were saying, right? Yeah. So that could be interesting to see if that works well. Have you, have you tried it out already or? Um, where I was coming from is, uh, the company is using, um, like in their own custom chatbot called jewel and it's like the unit of work.

But whenever I read transcripts, it's oftentimes used as jewel as the diamond. Gotcha. And so that's kind of a struggle that I'm facing. Okay. Is this actually at SAP? Nice. I'm an SAP child. I'm an SAP child myself. My, my, my father was early as well. Okay. Anyway, too, too much information.

Cool. Uh, yeah. Thanks. Thanks for that. Well, we can, we can chat some more and see sort of if that's something we, we can get going. Sweet. Uh, yeah. And with that, we're at time. Um, yeah. Thanks again. Thanks so much for joining. Please do, um, you know, connect, uh, find the resources, fill in the form for the credits.

Uh, yeah. I'll, I'll, I'll leave this up, uh, in case you haven't had a chance to scan it. But yeah. Thanks so much for joining. Enjoy the conference. And, uh, we will also have a booth at the expo. So if you come up with some more questions, you can come, uh, find us there.

Thank you. Dankeschön. Dankeschön. Dankeschön. Dankeschön. Dankeschön. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. We'll see you next time.

[Full Workshop] Building Conversational AI Agents - Thor Schaeff, ElevenLabs

Transcript