- - Hey everyone, I'm Brooke, I'm the founder of Koval, and we are building evals for voice agents. So today I'm gonna be talking a little bit about what can we learn from self-driving in building evals for voice agents. My background is from Waymo, so I led our eval job infrastructure team at Waymo that was responsible for our developer tools for launching and running simulations.
And now we're taking a lot of the learnings from self-driving and robotics and applying them to voice agents. But first, why are voice agents not everywhere? They have this massive promise of being able to automate all of these really critical, hard workflows autonomously. And I think probably a lot of you are building in voice agents and know how amazing voice agents can be.
I think the biggest problem to launching voice agents is trust. So we simultaneously are paradoxically overestimating voice agents and trying to say, I'm going to automate all of my call volume or all of my workflows with voice all at once. And we're underestimating them. I think what they're capable of today and really scoping a smaller problem versus what they could possibly do in the next six months.
I think there's just so much more that you could have for a magical voice experience. So conversational agents are so capable, but you also know that scaling to production is really hard. So it's easy to nail it for 10 conversations, but to do it for 10,000 or 100,000 becomes really difficult.
So a lot of times voice agents get stuck in POC hell where enterprises are scared to actually deploy them to customer-facing issues or non-internal workflows. So I think there's two approaches to deal with that today. There's conservative but deterministic. So you force the agent down the specific path and try and get it to do something exactly as you want it to do.
This is basically an expensive IVR tree. You're using LLMs, but you're essentially forcing it into certain pathways. Or you can make them much more autonomous and flexible to new scenarios that they've never seen before. But this makes it really hard to scale it to production because they're so unpredictable.
And so taking actions on behalf or interfacing with users can be really unpredictable. I think this is a false choice. I think you can have reliability and autonomy. So how many of you have taken the Waymo? Yeah, if you haven't and you're from out of town, you definitely should.
It is so magical. So how did it become so magical? It's so reliable and also so smooth, but it's able to navigate all of these interactions that it's never seen before, go down streets it's never seen before. And Waymo is launching to all of these new cities very quickly.
So I'm biased, but I think large-scale simulation has been the huge unlock for self-driving and robotics. Because without it, so we started in more manual evals. So starting with running the car throughout all the streets, noting where it doesn't go well, and then bringing that back to the engineers.
This obviously is very hard to scale, so then we created these specific tests where we're saying for this specific scenario, we expect these things to happen. But this is very brittle. Your scenarios tend to no longer be useful after a very short period of time, and they're very expensive to maintain, because you have to build up these very complicated scenarios, and then say exactly what should happen in those.
So then the industry as a whole has moved to large-scale evaluation. So how often is a certain type of event happening across many, many simulations? And so instead of trying to say for this specific instance, I want this to happen, you run large-scale simulation to really reliably show how the agent is performing.
So I'm going to talk through some of the things that I've learned from self-driving and how they apply to voice. And hopefully that's useful because it's definitely been useful for us as we interact with hundreds of voice systems. So what is the similarity between the two? Self-driving and conversational evals are very similar because both of them are systems where you're interacting with the real world, and for each step that you take, you have to respond to the environment and go back and forth.
And so simulations are really important for this because for every step that I take in a Waymo or a self-driving car or a smaller robotics device, or in a conversation, when I say, "Hello, what's your name?" you will respond differently than when I say, "Hello, what's your email?" So being able to simulate all of these possible scenarios is really important because otherwise you have to create these static tests or do it manually, and both of which are either expensive or brittle.
And so this doesn't make for very durable tests. If you have to specifically outline every single step along the way, those break immediately and are very expensive to maintain. And then lastly, coverage. You really want to simulate all of the possible scenarios across a very large area. And so the non-determinism of LLMs is actually really useful for this because you can show what are all of the possible things that someone might respond back to this and I might simulate that over and over and look for what the probability of my agent succeeding is.
Another thing that I touched on a bit is input-output evals versus probabilistic evals. So with LLMs to date, we have seen that you run a set of inputs for your prompt, excuse me, you run a set of inputs for your prompt and then you look at all the outputs and evaluate whether or not the output for that input was correct based on some criteria.
So you might have a golden data set that you're iterating against. With conversational evals, it becomes even more important to have reference-free evaluation where you don't necessarily need to say, "These are all of the expected things I am expecting for this exact input." But rather, you're defining as a whole, "How often is my agent resolving the user inquiry?
How often is my agent repeating itself over and over? How often is my agent saying things it shouldn't?" Rather than saying, "For this specific scenario, these six things should happen." And so this is what's going to really allow you to scale your evals, and also what we did at Waymo.
So coming up with metrics that apply to lots of scenarios. Another thing is that constant eval loops are what made autonomous vehicles scalable, and that's what's going to make voice agents scalable. I think we're seeing today that voice agents are so expensive to maintain in production once you deploy to an enterprise, it becomes often a professional service if you don't set up your processes right.
And so you're constantly making all of these tweaks for specific enterprises, which can take up 80% of your time even after you've set up the initial agent. So something that the autonomous vehicle industry has been doing is how do you run, like let's say you find a bug, and as an engineer, I might iterate on that, and I run a couple of evals to reproduce that, and then I fix the issue, and then I run more.
So it wasn't stopping at a stop sign. I iterate on that, and now it is stopping at a stop sign. But then I run a larger regression set, because maybe I just made the car stop every 10 seconds. And so I broke the everything. So then you run a larger regression set, and make sure you didn't break everything.
And then we have a set of pre-submit and post-submit CICD workflows. So that before you ship code, and then after you push the code to production, we make sure everything is continuously working. And then there's large-scale release evals. So making sure that everything is up to par before we launch a new release.
And this might be both manual evals and automated evals and some combination thereof. And then live monitoring and detection, which then you can feed back into this whole system. So we're emulating a lot of this with voice. I think we think that's the right approach as well. But you're notably that there's still manual evals involved.
The goal is not to automate all evals, but rather to leverage auto evals for speed and scale, and then use the manual time that you have to really focus on how those like very, you know, human touch judgment calls. So the process that we've seen is you might start with simulated conversations, and you run some happy paths of like, I know I should be able to book an appointment.
So book an appointment for tomorrow. I run a bunch of simulations of that. I run evals. I come up with metrics of, I look through all those conversations. Looking at your data is super important. I look at all those conversations and I say, these are the ways they're failing.
So I set up some automated metrics and iterate through this loop several times. Now I think it's ready for production. So I ship it to production and then run those evals again. And so this cycle, this virtuous cycle of iterating on through simulation and then detecting, or like flagging things for human review, and then feeding all of that back into your simulations is super important for scalable voice agents.
So what level of realism is actually needed? A question we get a lot is, are your voice agents exactly how my customers sound? And that's a good question because the level of realism is dependent on what you're trying to test. So like any scientific method, right? You're trying to control variables and then test for the things that you care about.
So there's something we saw in self-driving is that there's kind of this hierarchy of like, you might not need to simulate everything in order to get a representative feedback on how your system is doing. So the way we think, so for example, all the time there are all these, you know, super hyper-realistic simulations coming out that look like, you know, that look like a real video.
And people would say that simulation system is amazing. And really that's not necessarily true because what you want from a simulation system is how much can you control what parts of the system you're simulating and then how, like what inputs are needed. So you might just need to know this is a dog and I, this is a cat and this is a person walking across the street.
And then what should I do next as a result of those inputs? And so this is the same for voice. We think about it as, for example, workflows, tool calls, instruction following, you actually don't need to even simulate that with voice often. You might want to do end-to-end tests with voice, but when you're iterating, doing that all with text is probably the fastest and cheapest way to do that.
Dan for interruptions or latency or instructed pauses, simple voices that are just the basic, basic voices are sufficient because you're doing that voice-to-voice testing, but, you know, accents or background noises might not impact that as much. And then where you need hyper-realistic voices of different accents, different background noises, different audio quality, et cetera, is when you're testing those things in production and trying to recreate those issues.
And so thinking about what, what are the base level of components that you really need to develop this is super important for building a good eval strategy. And then an awesome tactic that we've learned is denoising. So you run a bunch of evals and then you might find one that failed.
And something that's really important about agents is that it doesn't, it's not the end of the world maybe if it fails one time. You really want to know what is the probability of this failing overall. So then you can find that scenario and then re-simulate that. And maybe you re-simulate that a hundred times.
So is this scenario failing 50 out of a hundred times? Is it a coin flip? Is it failing 99 out of a hundred times? Which means it's definitely always failing. Or does it fail once out of a hundred times? And that might be totally okay for your application. And so having a sense in the same way of cloud infrastructure where are you shooting for six nines of reliability?
For voice AI that's really important as well as like what reliability are you looking for for different parts of your product? So now I want to talk a bit about how to build an eval strategy. Because we believe that evals are as an important part of your process, it's the key part of your product development and it's not just an engineering best practice.
This is actually like a core part of thinking through what does your product do. So voice AI, like thinking through what metrics you should use is thinking through what does your product do and what do you want to be good at. You can build a general voice model that's kind of good at everything and that already exists, right?
Like you can use the open AI APIs, you can use like all of these different end-to-end voice systems that already exist and are generally useful. But really you're probably building a vertical agent that you're trying to make useful at something in particular. And so thinking about what you want it to do well and what you don't care if it does is a really important part of the process.
And it's not just about latency, it's about interruptions. Like your voice application actually might not be so latency sensitive because someone really wants a refund. But if you're doing outbound sales, latency is super important because that person's about to hang up the phone. Interruptions, workflows. Workflows, like how much you adhere to instruction following for some applications is really important.
Like if you're booking an appointment and don't get all the details, it's useless. But if you're, you know, an interviewer or a therapist, you might be more tuned for conversational evals, conversational workflows. And then, so really thinking through like what you're trying to measure. These are the five things that we see the most.
But LLM as a judge is a really powerful way of being really flexible and you can build out evals that are very specific for different customers. But something we get a lot is how do you trust LLM as a judge? It's this magical thing that can be so flexible to so many cases.
And also it is very, you know, can be very noisy. But I think the common patterns that we see with LLM as a judge is that you say, was this conversation successful? That's a pretty, that's going to be a really noisy metric. You might run that 10 times for the same conversation and we'll come back with lots of different responses.
So in Koval, we have this metric studio that we think is really like pretty different from anything out there because it allows you to iterate on this human, incorporate human feedback into really correlate, calibrating your metrics with human feedback. So you can iterate over and over until your automated metrics are aligning with human feedback.
And now you have the confidence to go deploy those in production and run them over 10,000 conversations instead of the 100 that you labeled or the 10 that you labeled to get that confidence. So really, I think putting in the time to saying this is the level of reliability that we're looking for and being thoughtful of that, maybe just labeling 10 conversations is important to you.
Or maybe you really want to dial this in. But using this workflow can be really powerful. Our other advice of how to approach evals for voice AI is starting with this system of reviewing public benchmarks, which can be a rough dial of this is how, this is roughly the direction I want to go in.
Then benchmarking with your own specific data. So using, if you're like a medical company, using medical terms that you're going to be using in production to test out different transcription methods, etc. Then running task-based evals, which are maybe text or a very specific smaller modules of your system. Again, what I talked about in self-driving is you don't necessarily need to enable every module on the car in order to test the one thing that you're trying to test.
And then end-to-end evals, where you're running everything at scale and how it would run in production. So we've done a lot of benchmarking. You should check out our benchmarking on our website. But we tried to do continuous benchmarking of what are the latest models out there. But doing your own custom benchmarking is also really important.
So through Koval, you can actually, and you can also do this yourself. I just happen to have a tool that does this. But you can run on your specific data because you might prefer different voices for the types of conversations that you're having, or you might prefer different LLMs based on your specific tasks.
And so benchmarking each part of your voice stack is really helpful for choosing out those models, especially because voice has so many models. And then building out your task evals. So starting to get a sense of baseline performance. Where are the problem areas in your voice application? Where are things working?
Where could they be better? And then creating an eval process. So this means, like, what kinds of continuous monitoring are we doing? What do we do when we find a bug in production from a customer? Who takes it? And what test sets does it go into so we can make sure that it doesn't happen again?
How do we set up our hierarchy of test sets? Do we have test sets for specific customers that we care a lot about? Do we have types of customers that we have? Do we have specific workflows or features of our voice agent? And then creating dashboards and processes so that you can check in on those things continuously?
I think this is an underestimated piece of the process is, like, what is our continuous eval process versus just saying, does the voice agent work when I deploy it to this production, this customer on there during their pilot period? So, yeah, always happy to talk more about tips on, like, what we've seen across all of the many voice systems that we've seen.
But one of the reasons why we're so excited about the future of voice, and I think Quinn stole a little bit of this, but Quinn, I really think that voice is the next platform. So we had web, we had mobile, and I think both of these were huge platform shifts in what types of things do you expect that companies will allow you to do on those platforms?
What types of work, where in the workflow, where in your daily life are you meeting the user? And I think voice is unlocking all of these new really natural voice experiences. It doesn't mean everything you should be doing via voice, but there's really exciting potential there. And in the next three years, we think every enterprise is going to launch a voice experience.
It's going to be like a mobile app, where if the airline does not have a good voice experience, it's going to be like not having a good mobile app, and it will just be a baseline expectation. And I think users' expectations of what really amazing magical voice AI experiences will be is just going to increase over the next few years.
So we really want to enable this future, and so we think the next gen of scalable voice AI will be built with integrated evals using COBOL. And we're hiring, so we're always looking for people to join us. I think this is really one of the most technically interesting fields that I have ever worked in because you get to work with every model across the stack, and there are so many different types of models, different types of problems, scalability, new frontiers of building infrastructure, and no one knows any of the answers.
So it's a really exciting space. Thanks so much, everyone.