back to index

From Self-driving to Autonomous Voice Agents — Brooke Hopkins, Coval


Whisper Transcript | Transcript Only Page

00:00:02.000 | - Hey everyone, I'm Brooke, I'm the founder of Koval,
00:00:18.400 | and we are building evals for voice agents.
00:00:20.920 | So today I'm gonna be talking a little bit about
00:00:23.200 | what can we learn from self-driving
00:00:24.840 | in building evals for voice agents.
00:00:27.160 | My background is from Waymo,
00:00:28.360 | so I led our eval job infrastructure team at Waymo
00:00:31.480 | that was responsible for our developer tools
00:00:33.560 | for launching and running simulations.
00:00:35.480 | And now we're taking a lot of the learnings
00:00:37.160 | from self-driving and robotics
00:00:38.500 | and applying them to voice agents.
00:00:40.560 | But first, why are voice agents not everywhere?
00:00:44.220 | They have this massive promise of being able to automate
00:00:47.040 | all of these really critical, hard workflows autonomously.
00:00:51.060 | And I think probably a lot of you are building in voice agents
00:00:53.940 | and know how amazing voice agents can be.
00:00:56.960 | I think the biggest problem to launching voice agents is trust.
00:01:00.080 | So we simultaneously are paradoxically overestimating voice agents
00:01:06.160 | and trying to say, I'm going to automate all of my call volume
00:01:09.360 | or all of my workflows with voice all at once.
00:01:12.880 | And we're underestimating them.
00:01:15.200 | I think what they're capable of today
00:01:17.200 | and really scoping a smaller problem
00:01:19.200 | versus what they could possibly do in the next six months.
00:01:22.200 | I think there's just so much more that you could have
00:01:25.560 | for a magical voice experience.
00:01:27.160 | So conversational agents are so capable,
00:01:31.480 | but you also know that scaling to production is really hard.
00:01:34.480 | So it's easy to nail it for 10 conversations,
00:01:36.320 | but to do it for 10,000 or 100,000 becomes really difficult.
00:01:40.280 | So a lot of times voice agents get stuck in POC hell
00:01:43.400 | where enterprises are scared to actually deploy them
00:01:46.760 | to customer-facing issues or non-internal workflows.
00:01:50.200 | So I think there's two approaches to deal with that today.
00:01:54.360 | There's conservative but deterministic.
00:01:56.920 | So you force the agent down the specific path
00:01:59.720 | and try and get it to do something exactly as you want it to do.
00:02:03.080 | This is basically an expensive IVR tree.
00:02:05.160 | You're using LLMs,
00:02:07.240 | but you're essentially forcing it into certain pathways.
00:02:09.800 | Or you can make them much more autonomous
00:02:12.440 | and flexible to new scenarios that they've never seen before.
00:02:15.640 | But this makes it really hard to scale it to production
00:02:18.760 | because they're so unpredictable.
00:02:20.280 | And so taking actions on behalf or interfacing with users
00:02:23.800 | can be really unpredictable.
00:02:26.120 | I think this is a false choice.
00:02:29.000 | I think you can have reliability and autonomy.
00:02:31.560 | So how many of you have taken the Waymo?
00:02:34.040 | Yeah, if you haven't and you're from out of town,
00:02:37.320 | you definitely should.
00:02:38.120 | It is so magical.
00:02:41.160 | So how did it become so magical?
00:02:43.240 | It's so reliable and also so smooth,
00:02:46.200 | but it's able to navigate all of these interactions
00:02:50.440 | that it's never seen before,
00:02:51.640 | go down streets it's never seen before.
00:02:53.480 | And Waymo is launching to all of these new cities
00:02:56.840 | very quickly.
00:02:57.480 | So I'm biased,
00:03:00.920 | but I think large-scale simulation has been the huge unlock
00:03:03.880 | for self-driving and robotics.
00:03:05.800 | Because without it,
00:03:06.760 | so we started in more manual evals.
00:03:09.640 | So starting with running the car throughout all the streets,
00:03:13.880 | noting where it doesn't go well,
00:03:15.640 | and then bringing that back to the engineers.
00:03:17.720 | This obviously is very hard to scale,
00:03:19.320 | so then we created these specific tests
00:03:21.480 | where we're saying for this specific scenario,
00:03:23.320 | we expect these things to happen.
00:03:24.760 | But this is very brittle.
00:03:26.920 | Your scenarios tend to no longer be useful after a very short period of time,
00:03:31.800 | and they're very expensive to maintain,
00:03:33.960 | because you have to build up these very complicated scenarios,
00:03:36.600 | and then say exactly what should happen in those.
00:03:38.360 | So then the industry as a whole has moved to large-scale evaluation.
00:03:44.040 | So how often is a certain type of event happening across many, many simulations?
00:03:50.040 | And so instead of trying to say for this specific instance,
00:03:53.160 | I want this to happen, you run large-scale simulation to really reliably show how the agent is performing.
00:03:59.080 | So I'm going to talk through some of the things that I've learned from self-driving
00:04:03.240 | and how they apply to voice.
00:04:04.680 | And hopefully that's useful because it's definitely been useful for us
00:04:08.280 | as we interact with hundreds of voice systems.
00:04:11.480 | So what is the similarity between the two?
00:04:15.080 | Self-driving and conversational evals are very similar
00:04:17.880 | because both of them are systems where you're interacting with the real world,
00:04:21.640 | and for each step that you take, you have to respond to the environment and go back and forth.
00:04:26.280 | And so simulations are really important for this because
00:04:31.480 | for every step that I take in a Waymo or a self-driving car or a smaller robotics device,
00:04:39.720 | or in a conversation, when I say, "Hello, what's your name?"
00:04:42.920 | you will respond differently than when I say, "Hello, what's your email?"
00:04:45.560 | So being able to simulate all of these possible scenarios is really important
00:04:50.520 | because otherwise you have to create these static tests or do it manually,
00:04:54.440 | and both of which are either expensive or brittle.
00:04:58.120 | And so this doesn't make for very durable tests.
00:05:02.600 | If you have to specifically outline every single step along the way,
00:05:05.800 | those break immediately and are very expensive to maintain.
00:05:09.080 | And then lastly, coverage.
00:05:11.240 | You really want to simulate all of the possible scenarios across a very large area.
00:05:16.040 | And so the non-determinism of LLMs is actually really useful for this because you can show
00:05:21.400 | what are all of the possible things that someone might respond back to this and I might simulate
00:05:25.240 | that over and over and look for what the probability of my agent succeeding is.
00:05:29.720 | Another thing that I touched on a bit is input-output evals versus probabilistic evals.
00:05:38.360 | So with LLMs to date, we have seen that you run a set of inputs for your prompt,
00:05:43.960 | excuse me, you run a set of inputs for your prompt and then you look at all the outputs and evaluate
00:05:50.200 | whether or not the output for that input was correct based on some criteria.
00:05:53.960 | So you might have a golden data set that you're iterating against.
00:05:57.960 | With conversational evals, it becomes even more important to have reference-free evaluation
00:06:03.000 | where you don't necessarily need to say, "These are all of the expected things I am expecting for this exact input."
00:06:10.200 | But rather, you're defining as a whole, "How often is my agent resolving the user inquiry?
00:06:15.560 | How often is my agent repeating itself over and over?
00:06:18.680 | How often is my agent saying things it shouldn't?"
00:06:21.160 | Rather than saying, "For this specific scenario, these six things should happen."
00:06:26.840 | And so this is what's going to really allow you to scale your evals, and also what we did at Waymo.
00:06:31.720 | So coming up with metrics that apply to lots of scenarios.
00:06:35.320 | Another thing is that constant eval loops are what made autonomous vehicles scalable,
00:06:41.000 | and that's what's going to make voice agents scalable.
00:06:43.960 | I think we're seeing today that voice agents are so expensive to maintain in production once you deploy to an enterprise,
00:06:50.120 | it becomes often a professional service if you don't set up your processes right.
00:06:55.480 | And so you're constantly making all of these tweaks for specific enterprises, which can take up 80%
00:07:00.360 | of your time even after you've set up the initial agent.
00:07:03.400 | So something that the autonomous vehicle industry has been doing is how do you run,
00:07:09.000 | like let's say you find a bug, and as an engineer, I might iterate on that,
00:07:13.560 | and I run a couple of evals to reproduce that, and then I fix the issue, and then I run more.
00:07:17.960 | So it wasn't stopping at a stop sign. I iterate on that, and now it is stopping at a stop sign.
00:07:22.840 | But then I run a larger regression set, because maybe I just made the car stop every 10 seconds.
00:07:28.360 | And so I broke the everything. So then you run a larger regression set, and make sure you didn't break everything.
00:07:33.800 | And then we have a set of pre-submit and post-submit CICD workflows.
00:07:38.840 | So that before you ship code, and then after you push the code to production, we make sure everything is continuously working.
00:07:45.720 | And then there's large-scale release evals. So making sure that everything is up to par before we launch a new release.
00:07:52.280 | And this might be both manual evals and automated evals and some combination thereof.
00:07:57.320 | And then live monitoring and detection, which then you can feed back into this whole system.
00:08:03.560 | So we're emulating a lot of this with voice. I think we think that's the right approach as well.
00:08:11.000 | But you're notably that there's still manual evals involved. The goal is not to automate all evals,
00:08:18.200 | but rather to leverage auto evals for speed and scale, and then use the manual time that you have
00:08:24.440 | to really focus on how those like very, you know, human touch judgment calls.
00:08:30.280 | So the process that we've seen is you might start with simulated conversations,
00:08:35.320 | and you run some happy paths of like, I know I should be able to book an appointment.
00:08:38.680 | So book an appointment for tomorrow. I run a bunch of simulations of that. I run evals.
00:08:44.600 | I come up with metrics of, I look through all those conversations. Looking at your data is super
00:08:48.520 | important. I look at all those conversations and I say, these are the ways they're failing. So I set
00:08:53.240 | up some automated metrics and iterate through this loop several times. Now I think it's ready for
00:08:58.280 | production. So I ship it to production and then run those evals again. And so this cycle, this virtuous
00:09:04.120 | cycle of iterating on through simulation and then detecting, or like flagging things for human review,
00:09:11.080 | and then feeding all of that back into your simulations is super important for scalable voice
00:09:15.560 | agents.
00:09:15.960 | So what level of realism is actually needed?
00:09:22.040 | A question we get a lot is, are your voice agents exactly how my customers sound?
00:09:29.480 | And that's a good question because the level of realism is dependent on what you're trying to test.
00:09:35.320 | So like any scientific method, right? You're trying to control variables and then test for
00:09:40.120 | the things that you care about. So there's something we saw in self-driving is that there's kind of this
00:09:45.800 | hierarchy of like, you might not need to simulate everything in order to get a representative feedback
00:09:51.880 | on how your system is doing. So the way we think, so for example, all the time there are all these,
00:09:57.720 | you know, super hyper-realistic simulations coming out that look like, you know, that look like a real
00:10:03.480 | video. And people would say that simulation system is amazing. And really that's not necessarily true
00:10:09.240 | because what you want from a simulation system is how much can you control what parts of the system
00:10:14.280 | you're simulating and then how, like what inputs are needed. So you might just need to know this is a dog
00:10:20.360 | and I, this is a cat and this is a person walking across the street. And then what should I do next
00:10:26.840 | as a result of those inputs? And so this is the same for voice. We think about it as, for example,
00:10:32.440 | workflows, tool calls, instruction following, you actually don't need to even simulate that with
00:10:36.840 | voice often. You might want to do end-to-end tests with voice, but when you're iterating,
00:10:41.320 | doing that all with text is probably the fastest and cheapest way to do that.
00:10:45.480 | Dan for interruptions or latency or instructed pauses, simple voices that are just the basic,
00:10:53.240 | basic voices are sufficient because you're doing that voice-to-voice testing, but, you know, accents
00:10:58.680 | or background noises might not impact that as much. And then where you need hyper-realistic voices of
00:11:04.440 | different accents, different background noises, different audio quality, et cetera, is when you're
00:11:09.720 | testing those things in production and trying to recreate those issues. And so thinking about what,
00:11:15.160 | what are the base level of components that you really need to develop this is super important
00:11:21.560 | for building a good eval strategy.
00:11:24.040 | And then an awesome tactic that we've learned is denoising. So you run a bunch of evals and then you
00:11:34.360 | might find one that failed. And something that's really important about agents is that it doesn't,
00:11:38.920 | it's not the end of the world maybe if it fails one time. You really want to know what is the
00:11:43.000 | probability of this failing overall. So then you can find that scenario and then re-simulate that. And
00:11:49.160 | maybe you re-simulate that a hundred times. So is this scenario failing 50 out of a hundred times?
00:11:54.520 | Is it a coin flip? Is it failing 99 out of a hundred times? Which means it's definitely always failing.
00:11:59.800 | Or does it fail once out of a hundred times? And that might be totally okay for your application.
00:12:05.240 | And so having a sense in the same way of cloud infrastructure where are you shooting for six
00:12:09.640 | nines of reliability? For voice AI that's really important as well as like what reliability are you
00:12:14.520 | looking for for different parts of your product? So now I want to talk a bit about how to build an
00:12:20.120 | eval strategy. Because we believe that evals are as an important part of your process,
00:12:25.560 | it's the key part of your product development and it's not just an engineering best practice.
00:12:30.200 | This is actually like a core part of thinking through what does your product do.
00:12:33.880 | So voice AI, like thinking through what metrics you should use is thinking through what does your product
00:12:40.440 | do and what do you want to be good at. You can build a general voice model that's kind of good at
00:12:44.840 | everything and that already exists, right? Like you can use the open AI APIs, you can use like all of these
00:12:51.400 | different end-to-end voice systems that already exist and are generally useful. But really you're
00:12:57.720 | probably building a vertical agent that you're trying to make useful at something in particular.
00:13:01.800 | And so thinking about what you want it to do well and what you don't care if it does is a really
00:13:06.040 | important part of the process. And it's not just about latency, it's about interruptions. Like your voice
00:13:12.040 | application actually might not be so latency sensitive because someone really wants a refund. But if you're
00:13:16.760 | doing outbound sales, latency is super important because that person's about to hang up the phone.
00:13:21.080 | Interruptions, workflows. Workflows, like how much you adhere to instruction following for some
00:13:29.480 | applications is really important. Like if you're booking an appointment and don't get all the details,
00:13:33.640 | it's useless. But if you're, you know, an interviewer or a therapist, you might be more
00:13:39.400 | tuned for conversational evals, conversational workflows. And then, so really thinking through
00:13:46.600 | like what you're trying to measure. These are the five things that we see the most. But
00:13:50.520 | LLM as a judge is a really powerful way of being really flexible and you can build out evals that are
00:13:57.240 | very specific for different customers. But something we get a lot is how do you trust LLM as a judge?
00:14:02.200 | It's this magical thing that can be so flexible to so many cases. And also it is very, you know,
00:14:09.160 | can be very noisy. But I think the common patterns that we see with LLM as a judge is that you say,
00:14:14.040 | was this conversation successful? That's a pretty, that's going to be a really noisy metric.
00:14:18.440 | You might run that 10 times for the same conversation and we'll come back
00:14:22.840 | with lots of different responses. So in Koval, we have this metric studio that we think is really
00:14:28.680 | like pretty different from anything out there because it allows you to iterate on this human,
00:14:34.120 | incorporate human feedback into really correlate, calibrating your metrics with human feedback.
00:14:40.280 | So you can iterate over and over until your automated metrics are aligning with human feedback.
00:14:45.560 | And now you have the confidence to go deploy those in production and run them over 10,000
00:14:49.560 | conversations instead of the 100 that you labeled or the 10 that you labeled to get that confidence.
00:14:56.520 | So really, I think putting in the time to saying this is the level of reliability that we're looking
00:15:03.720 | for and being thoughtful of that, maybe just labeling 10 conversations is important to you.
00:15:07.960 | Or maybe you really want to dial this in. But using this workflow can be really powerful.
00:15:16.520 | Our other advice of how to approach evals for voice AI is starting with this system of reviewing public
00:15:22.680 | benchmarks, which can be a rough dial of this is how, this is roughly the direction I want to go in.
00:15:28.840 | Then benchmarking with your own specific data. So using, if you're like a medical company,
00:15:34.440 | using medical terms that you're going to be using in production to test out different transcription methods,
00:15:40.440 | etc. Then running task-based evals, which are maybe text or a very specific smaller modules of your
00:15:46.600 | system. Again, what I talked about in self-driving is you don't necessarily need to enable every module
00:15:51.400 | on the car in order to test the one thing that you're trying to test. And then end-to-end evals,
00:15:55.880 | where you're running everything at scale and how it would run in production.
00:16:01.160 | So we've done a lot of benchmarking. You should check out our benchmarking
00:16:04.280 | on our website. But we tried to do continuous benchmarking of what are the latest models out
00:16:11.320 | there. But doing your own custom benchmarking is also really important. So through Koval,
00:16:16.200 | you can actually, and you can also do this yourself. I just happen to have a tool that does this.
00:16:21.320 | But you can run on your specific data because you might prefer different voices for the types of
00:16:27.640 | conversations that you're having, or you might prefer different LLMs based on your specific tasks.
00:16:32.680 | And so benchmarking each part of your voice stack is really helpful for choosing out those models,
00:16:38.680 | especially because voice has so many models. And then building out your task evals. So starting to
00:16:45.720 | get a sense of baseline performance. Where are the problem areas in your voice application? Where are things
00:16:50.360 | working? Where could they be better? And then creating an eval process. So this means, like,
00:16:57.000 | what kinds of continuous monitoring are we doing? What do we do when we find a bug in production from
00:17:01.800 | a customer? Who takes it? And what test sets does it go into so we can make sure that it doesn't happen
00:17:06.840 | again? How do we set up our hierarchy of test sets? Do we have test sets for specific customers that we
00:17:11.480 | care a lot about? Do we have types of customers that we have? Do we have specific workflows or features of our
00:17:17.000 | voice agent? And then creating dashboards and processes so that you can check in on those things
00:17:22.360 | continuously? I think this is an underestimated piece of the process is, like, what is our continuous
00:17:27.560 | eval process versus just saying, does the voice agent work when I deploy it to this production,
00:17:31.960 | this customer on there during their pilot period?
00:17:38.520 | So, yeah, always happy to talk more about tips on, like, what we've seen across all of the many voice
00:17:44.600 | systems that we've seen. But one of the reasons why we're so excited about the future of voice,
00:17:49.480 | and I think Quinn stole a little bit of this, but Quinn, I really think that voice is the next platform.
00:17:57.400 | So we had web, we had mobile, and I think both of these were huge platform shifts in what types of
00:18:04.360 | things do you expect that companies will allow you to do on those platforms? What types of work,
00:18:09.160 | where in the workflow, where in your daily life are you meeting the user? And I think voice is unlocking
00:18:15.880 | all of these new really natural voice experiences. It doesn't mean everything you should be doing
00:18:20.280 | via voice, but there's really exciting potential there.
00:18:23.240 | And in the next three years, we think every enterprise is going to launch a voice experience.
00:18:30.360 | It's going to be like a mobile app, where if the airline does not have a good voice experience,
00:18:35.800 | it's going to be like not having a good mobile app, and it will just be a baseline expectation. And I think
00:18:41.000 | users' expectations of what really amazing magical voice AI experiences will be is just going to
00:18:47.080 | increase over the next few years. So we really want to enable this future, and so we think the next gen
00:18:55.640 | of scalable voice AI will be built with integrated evals using COBOL.
00:18:59.640 | And we're hiring, so we're always looking for people to join us. I think this is
00:19:06.440 | really one of the most technically interesting fields that I have ever worked in because you get
00:19:11.640 | to work with every model across the stack, and there are so many different types of models,
00:19:16.360 | different types of problems, scalability, new frontiers of building infrastructure, and no one
00:19:21.480 | knows any of the answers. So it's a really exciting space. Thanks so much, everyone.