back to indexFrom Self-driving to Autonomous Voice Agents — Brooke Hopkins, Coval

00:00:02.000 |
- Hey everyone, I'm Brooke, I'm the founder of Koval, 00:00:20.920 |
So today I'm gonna be talking a little bit about 00:00:28.360 |
so I led our eval job infrastructure team at Waymo 00:00:40.560 |
But first, why are voice agents not everywhere? 00:00:44.220 |
They have this massive promise of being able to automate 00:00:47.040 |
all of these really critical, hard workflows autonomously. 00:00:51.060 |
And I think probably a lot of you are building in voice agents 00:00:56.960 |
I think the biggest problem to launching voice agents is trust. 00:01:00.080 |
So we simultaneously are paradoxically overestimating voice agents 00:01:06.160 |
and trying to say, I'm going to automate all of my call volume 00:01:09.360 |
or all of my workflows with voice all at once. 00:01:19.200 |
versus what they could possibly do in the next six months. 00:01:22.200 |
I think there's just so much more that you could have 00:01:31.480 |
but you also know that scaling to production is really hard. 00:01:34.480 |
So it's easy to nail it for 10 conversations, 00:01:36.320 |
but to do it for 10,000 or 100,000 becomes really difficult. 00:01:40.280 |
So a lot of times voice agents get stuck in POC hell 00:01:43.400 |
where enterprises are scared to actually deploy them 00:01:46.760 |
to customer-facing issues or non-internal workflows. 00:01:50.200 |
So I think there's two approaches to deal with that today. 00:01:56.920 |
So you force the agent down the specific path 00:01:59.720 |
and try and get it to do something exactly as you want it to do. 00:02:07.240 |
but you're essentially forcing it into certain pathways. 00:02:12.440 |
and flexible to new scenarios that they've never seen before. 00:02:15.640 |
But this makes it really hard to scale it to production 00:02:20.280 |
And so taking actions on behalf or interfacing with users 00:02:29.000 |
I think you can have reliability and autonomy. 00:02:34.040 |
Yeah, if you haven't and you're from out of town, 00:02:46.200 |
but it's able to navigate all of these interactions 00:02:53.480 |
And Waymo is launching to all of these new cities 00:03:00.920 |
but I think large-scale simulation has been the huge unlock 00:03:09.640 |
So starting with running the car throughout all the streets, 00:03:15.640 |
and then bringing that back to the engineers. 00:03:21.480 |
where we're saying for this specific scenario, 00:03:26.920 |
Your scenarios tend to no longer be useful after a very short period of time, 00:03:33.960 |
because you have to build up these very complicated scenarios, 00:03:36.600 |
and then say exactly what should happen in those. 00:03:38.360 |
So then the industry as a whole has moved to large-scale evaluation. 00:03:44.040 |
So how often is a certain type of event happening across many, many simulations? 00:03:50.040 |
And so instead of trying to say for this specific instance, 00:03:53.160 |
I want this to happen, you run large-scale simulation to really reliably show how the agent is performing. 00:03:59.080 |
So I'm going to talk through some of the things that I've learned from self-driving 00:04:04.680 |
And hopefully that's useful because it's definitely been useful for us 00:04:08.280 |
as we interact with hundreds of voice systems. 00:04:15.080 |
Self-driving and conversational evals are very similar 00:04:17.880 |
because both of them are systems where you're interacting with the real world, 00:04:21.640 |
and for each step that you take, you have to respond to the environment and go back and forth. 00:04:26.280 |
And so simulations are really important for this because 00:04:31.480 |
for every step that I take in a Waymo or a self-driving car or a smaller robotics device, 00:04:39.720 |
or in a conversation, when I say, "Hello, what's your name?" 00:04:42.920 |
you will respond differently than when I say, "Hello, what's your email?" 00:04:45.560 |
So being able to simulate all of these possible scenarios is really important 00:04:50.520 |
because otherwise you have to create these static tests or do it manually, 00:04:54.440 |
and both of which are either expensive or brittle. 00:04:58.120 |
And so this doesn't make for very durable tests. 00:05:02.600 |
If you have to specifically outline every single step along the way, 00:05:05.800 |
those break immediately and are very expensive to maintain. 00:05:11.240 |
You really want to simulate all of the possible scenarios across a very large area. 00:05:16.040 |
And so the non-determinism of LLMs is actually really useful for this because you can show 00:05:21.400 |
what are all of the possible things that someone might respond back to this and I might simulate 00:05:25.240 |
that over and over and look for what the probability of my agent succeeding is. 00:05:29.720 |
Another thing that I touched on a bit is input-output evals versus probabilistic evals. 00:05:38.360 |
So with LLMs to date, we have seen that you run a set of inputs for your prompt, 00:05:43.960 |
excuse me, you run a set of inputs for your prompt and then you look at all the outputs and evaluate 00:05:50.200 |
whether or not the output for that input was correct based on some criteria. 00:05:53.960 |
So you might have a golden data set that you're iterating against. 00:05:57.960 |
With conversational evals, it becomes even more important to have reference-free evaluation 00:06:03.000 |
where you don't necessarily need to say, "These are all of the expected things I am expecting for this exact input." 00:06:10.200 |
But rather, you're defining as a whole, "How often is my agent resolving the user inquiry? 00:06:15.560 |
How often is my agent repeating itself over and over? 00:06:18.680 |
How often is my agent saying things it shouldn't?" 00:06:21.160 |
Rather than saying, "For this specific scenario, these six things should happen." 00:06:26.840 |
And so this is what's going to really allow you to scale your evals, and also what we did at Waymo. 00:06:31.720 |
So coming up with metrics that apply to lots of scenarios. 00:06:35.320 |
Another thing is that constant eval loops are what made autonomous vehicles scalable, 00:06:41.000 |
and that's what's going to make voice agents scalable. 00:06:43.960 |
I think we're seeing today that voice agents are so expensive to maintain in production once you deploy to an enterprise, 00:06:50.120 |
it becomes often a professional service if you don't set up your processes right. 00:06:55.480 |
And so you're constantly making all of these tweaks for specific enterprises, which can take up 80% 00:07:00.360 |
of your time even after you've set up the initial agent. 00:07:03.400 |
So something that the autonomous vehicle industry has been doing is how do you run, 00:07:09.000 |
like let's say you find a bug, and as an engineer, I might iterate on that, 00:07:13.560 |
and I run a couple of evals to reproduce that, and then I fix the issue, and then I run more. 00:07:17.960 |
So it wasn't stopping at a stop sign. I iterate on that, and now it is stopping at a stop sign. 00:07:22.840 |
But then I run a larger regression set, because maybe I just made the car stop every 10 seconds. 00:07:28.360 |
And so I broke the everything. So then you run a larger regression set, and make sure you didn't break everything. 00:07:33.800 |
And then we have a set of pre-submit and post-submit CICD workflows. 00:07:38.840 |
So that before you ship code, and then after you push the code to production, we make sure everything is continuously working. 00:07:45.720 |
And then there's large-scale release evals. So making sure that everything is up to par before we launch a new release. 00:07:52.280 |
And this might be both manual evals and automated evals and some combination thereof. 00:07:57.320 |
And then live monitoring and detection, which then you can feed back into this whole system. 00:08:03.560 |
So we're emulating a lot of this with voice. I think we think that's the right approach as well. 00:08:11.000 |
But you're notably that there's still manual evals involved. The goal is not to automate all evals, 00:08:18.200 |
but rather to leverage auto evals for speed and scale, and then use the manual time that you have 00:08:24.440 |
to really focus on how those like very, you know, human touch judgment calls. 00:08:30.280 |
So the process that we've seen is you might start with simulated conversations, 00:08:35.320 |
and you run some happy paths of like, I know I should be able to book an appointment. 00:08:38.680 |
So book an appointment for tomorrow. I run a bunch of simulations of that. I run evals. 00:08:44.600 |
I come up with metrics of, I look through all those conversations. Looking at your data is super 00:08:48.520 |
important. I look at all those conversations and I say, these are the ways they're failing. So I set 00:08:53.240 |
up some automated metrics and iterate through this loop several times. Now I think it's ready for 00:08:58.280 |
production. So I ship it to production and then run those evals again. And so this cycle, this virtuous 00:09:04.120 |
cycle of iterating on through simulation and then detecting, or like flagging things for human review, 00:09:11.080 |
and then feeding all of that back into your simulations is super important for scalable voice 00:09:22.040 |
A question we get a lot is, are your voice agents exactly how my customers sound? 00:09:29.480 |
And that's a good question because the level of realism is dependent on what you're trying to test. 00:09:35.320 |
So like any scientific method, right? You're trying to control variables and then test for 00:09:40.120 |
the things that you care about. So there's something we saw in self-driving is that there's kind of this 00:09:45.800 |
hierarchy of like, you might not need to simulate everything in order to get a representative feedback 00:09:51.880 |
on how your system is doing. So the way we think, so for example, all the time there are all these, 00:09:57.720 |
you know, super hyper-realistic simulations coming out that look like, you know, that look like a real 00:10:03.480 |
video. And people would say that simulation system is amazing. And really that's not necessarily true 00:10:09.240 |
because what you want from a simulation system is how much can you control what parts of the system 00:10:14.280 |
you're simulating and then how, like what inputs are needed. So you might just need to know this is a dog 00:10:20.360 |
and I, this is a cat and this is a person walking across the street. And then what should I do next 00:10:26.840 |
as a result of those inputs? And so this is the same for voice. We think about it as, for example, 00:10:32.440 |
workflows, tool calls, instruction following, you actually don't need to even simulate that with 00:10:36.840 |
voice often. You might want to do end-to-end tests with voice, but when you're iterating, 00:10:41.320 |
doing that all with text is probably the fastest and cheapest way to do that. 00:10:45.480 |
Dan for interruptions or latency or instructed pauses, simple voices that are just the basic, 00:10:53.240 |
basic voices are sufficient because you're doing that voice-to-voice testing, but, you know, accents 00:10:58.680 |
or background noises might not impact that as much. And then where you need hyper-realistic voices of 00:11:04.440 |
different accents, different background noises, different audio quality, et cetera, is when you're 00:11:09.720 |
testing those things in production and trying to recreate those issues. And so thinking about what, 00:11:15.160 |
what are the base level of components that you really need to develop this is super important 00:11:24.040 |
And then an awesome tactic that we've learned is denoising. So you run a bunch of evals and then you 00:11:34.360 |
might find one that failed. And something that's really important about agents is that it doesn't, 00:11:38.920 |
it's not the end of the world maybe if it fails one time. You really want to know what is the 00:11:43.000 |
probability of this failing overall. So then you can find that scenario and then re-simulate that. And 00:11:49.160 |
maybe you re-simulate that a hundred times. So is this scenario failing 50 out of a hundred times? 00:11:54.520 |
Is it a coin flip? Is it failing 99 out of a hundred times? Which means it's definitely always failing. 00:11:59.800 |
Or does it fail once out of a hundred times? And that might be totally okay for your application. 00:12:05.240 |
And so having a sense in the same way of cloud infrastructure where are you shooting for six 00:12:09.640 |
nines of reliability? For voice AI that's really important as well as like what reliability are you 00:12:14.520 |
looking for for different parts of your product? So now I want to talk a bit about how to build an 00:12:20.120 |
eval strategy. Because we believe that evals are as an important part of your process, 00:12:25.560 |
it's the key part of your product development and it's not just an engineering best practice. 00:12:30.200 |
This is actually like a core part of thinking through what does your product do. 00:12:33.880 |
So voice AI, like thinking through what metrics you should use is thinking through what does your product 00:12:40.440 |
do and what do you want to be good at. You can build a general voice model that's kind of good at 00:12:44.840 |
everything and that already exists, right? Like you can use the open AI APIs, you can use like all of these 00:12:51.400 |
different end-to-end voice systems that already exist and are generally useful. But really you're 00:12:57.720 |
probably building a vertical agent that you're trying to make useful at something in particular. 00:13:01.800 |
And so thinking about what you want it to do well and what you don't care if it does is a really 00:13:06.040 |
important part of the process. And it's not just about latency, it's about interruptions. Like your voice 00:13:12.040 |
application actually might not be so latency sensitive because someone really wants a refund. But if you're 00:13:16.760 |
doing outbound sales, latency is super important because that person's about to hang up the phone. 00:13:21.080 |
Interruptions, workflows. Workflows, like how much you adhere to instruction following for some 00:13:29.480 |
applications is really important. Like if you're booking an appointment and don't get all the details, 00:13:33.640 |
it's useless. But if you're, you know, an interviewer or a therapist, you might be more 00:13:39.400 |
tuned for conversational evals, conversational workflows. And then, so really thinking through 00:13:46.600 |
like what you're trying to measure. These are the five things that we see the most. But 00:13:50.520 |
LLM as a judge is a really powerful way of being really flexible and you can build out evals that are 00:13:57.240 |
very specific for different customers. But something we get a lot is how do you trust LLM as a judge? 00:14:02.200 |
It's this magical thing that can be so flexible to so many cases. And also it is very, you know, 00:14:09.160 |
can be very noisy. But I think the common patterns that we see with LLM as a judge is that you say, 00:14:14.040 |
was this conversation successful? That's a pretty, that's going to be a really noisy metric. 00:14:18.440 |
You might run that 10 times for the same conversation and we'll come back 00:14:22.840 |
with lots of different responses. So in Koval, we have this metric studio that we think is really 00:14:28.680 |
like pretty different from anything out there because it allows you to iterate on this human, 00:14:34.120 |
incorporate human feedback into really correlate, calibrating your metrics with human feedback. 00:14:40.280 |
So you can iterate over and over until your automated metrics are aligning with human feedback. 00:14:45.560 |
And now you have the confidence to go deploy those in production and run them over 10,000 00:14:49.560 |
conversations instead of the 100 that you labeled or the 10 that you labeled to get that confidence. 00:14:56.520 |
So really, I think putting in the time to saying this is the level of reliability that we're looking 00:15:03.720 |
for and being thoughtful of that, maybe just labeling 10 conversations is important to you. 00:15:07.960 |
Or maybe you really want to dial this in. But using this workflow can be really powerful. 00:15:16.520 |
Our other advice of how to approach evals for voice AI is starting with this system of reviewing public 00:15:22.680 |
benchmarks, which can be a rough dial of this is how, this is roughly the direction I want to go in. 00:15:28.840 |
Then benchmarking with your own specific data. So using, if you're like a medical company, 00:15:34.440 |
using medical terms that you're going to be using in production to test out different transcription methods, 00:15:40.440 |
etc. Then running task-based evals, which are maybe text or a very specific smaller modules of your 00:15:46.600 |
system. Again, what I talked about in self-driving is you don't necessarily need to enable every module 00:15:51.400 |
on the car in order to test the one thing that you're trying to test. And then end-to-end evals, 00:15:55.880 |
where you're running everything at scale and how it would run in production. 00:16:01.160 |
So we've done a lot of benchmarking. You should check out our benchmarking 00:16:04.280 |
on our website. But we tried to do continuous benchmarking of what are the latest models out 00:16:11.320 |
there. But doing your own custom benchmarking is also really important. So through Koval, 00:16:16.200 |
you can actually, and you can also do this yourself. I just happen to have a tool that does this. 00:16:21.320 |
But you can run on your specific data because you might prefer different voices for the types of 00:16:27.640 |
conversations that you're having, or you might prefer different LLMs based on your specific tasks. 00:16:32.680 |
And so benchmarking each part of your voice stack is really helpful for choosing out those models, 00:16:38.680 |
especially because voice has so many models. And then building out your task evals. So starting to 00:16:45.720 |
get a sense of baseline performance. Where are the problem areas in your voice application? Where are things 00:16:50.360 |
working? Where could they be better? And then creating an eval process. So this means, like, 00:16:57.000 |
what kinds of continuous monitoring are we doing? What do we do when we find a bug in production from 00:17:01.800 |
a customer? Who takes it? And what test sets does it go into so we can make sure that it doesn't happen 00:17:06.840 |
again? How do we set up our hierarchy of test sets? Do we have test sets for specific customers that we 00:17:11.480 |
care a lot about? Do we have types of customers that we have? Do we have specific workflows or features of our 00:17:17.000 |
voice agent? And then creating dashboards and processes so that you can check in on those things 00:17:22.360 |
continuously? I think this is an underestimated piece of the process is, like, what is our continuous 00:17:27.560 |
eval process versus just saying, does the voice agent work when I deploy it to this production, 00:17:31.960 |
this customer on there during their pilot period? 00:17:38.520 |
So, yeah, always happy to talk more about tips on, like, what we've seen across all of the many voice 00:17:44.600 |
systems that we've seen. But one of the reasons why we're so excited about the future of voice, 00:17:49.480 |
and I think Quinn stole a little bit of this, but Quinn, I really think that voice is the next platform. 00:17:57.400 |
So we had web, we had mobile, and I think both of these were huge platform shifts in what types of 00:18:04.360 |
things do you expect that companies will allow you to do on those platforms? What types of work, 00:18:09.160 |
where in the workflow, where in your daily life are you meeting the user? And I think voice is unlocking 00:18:15.880 |
all of these new really natural voice experiences. It doesn't mean everything you should be doing 00:18:20.280 |
via voice, but there's really exciting potential there. 00:18:23.240 |
And in the next three years, we think every enterprise is going to launch a voice experience. 00:18:30.360 |
It's going to be like a mobile app, where if the airline does not have a good voice experience, 00:18:35.800 |
it's going to be like not having a good mobile app, and it will just be a baseline expectation. And I think 00:18:41.000 |
users' expectations of what really amazing magical voice AI experiences will be is just going to 00:18:47.080 |
increase over the next few years. So we really want to enable this future, and so we think the next gen 00:18:55.640 |
of scalable voice AI will be built with integrated evals using COBOL. 00:18:59.640 |
And we're hiring, so we're always looking for people to join us. I think this is 00:19:06.440 |
really one of the most technically interesting fields that I have ever worked in because you get 00:19:11.640 |
to work with every model across the stack, and there are so many different types of models, 00:19:16.360 |
different types of problems, scalability, new frontiers of building infrastructure, and no one 00:19:21.480 |
knows any of the answers. So it's a really exciting space. Thanks so much, everyone.