back to indexRohit Prasad: Solving Far-Field Speech Recognition and Intent Understanding | AI Podcast Clips
00:00:10.840 |
And when I joined, the product was already in conception 00:00:15.320 |
and we started working on the far field speech recognition 00:00:20.760 |
By that, we mean that you should be able to speak 00:00:25.080 |
And in those days, that wasn't a common practice. 00:00:28.640 |
And even in the previous research world I was in, 00:00:34.480 |
in terms of whether you can converse from a length. 00:00:38.200 |
And here I'm still talking about the first part 00:00:47.000 |
which means the word Alexa has to be detected 00:00:50.240 |
with a very high accuracy because it is a very common word. 00:00:54.760 |
It has sound units that map with words like I like you 00:01:04.680 |
to detect the right mentions of Alexa's address 00:01:15.920 |
- Not only noise, but a lot of conversation in the house. 00:01:20.160 |
you're simply listening for the wake word, Alexa. 00:01:23.040 |
And there's a lot of words being spoken in the house. 00:01:25.640 |
How do you know it's Alexa and directed at Alexa? 00:01:30.640 |
Because I could say, I love my Alexa, I hate my Alexa, 00:01:36.880 |
And in all these three sentences I said Alexa, 00:01:43.600 |
What would be your device that I should probably 00:01:46.560 |
in the introduction of this conversation give to people 00:01:49.800 |
in terms of with them turning off their Alexa device 00:01:53.340 |
if they're listening to this podcast conversation out loud? 00:02:02.160 |
Because we mentioned Alexa like a million times. 00:02:05.080 |
- So it will, we have done a lot of different things 00:02:08.000 |
where we can figure out that there is the device, 00:02:13.000 |
the speech is coming from a human versus over the air. 00:02:28.680 |
But yes, if this kind of a podcast is happening, 00:02:31.500 |
it's possible your device will wake up a few times. 00:02:35.320 |
but it is definitely something we care very much about. 00:02:47.400 |
versus I like something, I mean, that's a fascinating part. 00:02:59.800 |
not like something where the phone is sitting on the table. 00:03:02.820 |
This is like people have devices 40 feet away, 00:03:12.360 |
The next is, okay, you're speaking to the device. 00:03:15.720 |
Of course, you're gonna issue many different requests. 00:03:18.880 |
Some may be simple, some may be extremely hard, 00:03:24.480 |
where the audio is now not coming onto your phone 00:03:28.640 |
or a handheld mic like this or a close talking mic, 00:03:38.640 |
your daughter may be running around with something 00:03:41.400 |
and asking your mom something and so forth, right? 00:03:50.000 |
need to be recognized with very high accuracy, right? 00:03:53.200 |
Now we are still just in the recognition problem. 00:03:55.640 |
We haven't yet come to the understanding one, right? 00:04:00.800 |
Is this before neural networks began to start 00:04:05.280 |
to seriously prove themselves in the audio space? 00:04:10.320 |
- Yeah, this is around, so I joined in 2013, in April, right? 00:04:15.400 |
So the early research and neural networks coming back 00:04:21.160 |
in speech recognition space had started happening, 00:04:27.720 |
on the very first thing we did when I joined the team, 00:04:32.720 |
and remember it was a very much of a startup environment, 00:04:41.160 |
And we knew we'll have to improve accuracy fast. 00:04:48.880 |
and the scale of data, once you have a device like this, 00:04:54.880 |
Like you'll suddenly have large volumes of data 00:04:58.000 |
to learn from to make the customer experience better. 00:05:19.880 |
to be able to train on thousands and thousands of speech. 00:05:25.560 |
So if you ask me like in back in 2013 and 2014, 00:05:48.360 |
to the extent it could be useful to the customers. 00:05:52.960 |
at recognizing speech, but we are great at it 00:05:55.440 |
in terms of the settings that are in homes, right? 00:05:58.320 |
So, and that was important even in the early stages. 00:06:06.960 |
it seems like the task would be pretty daunting. 00:06:17.520 |
- So let me like how, first of all, you mentioned startup. 00:06:30.640 |
How likely were you to fail in the eyes of everyone else? 00:06:38.680 |
- I'll give you a very interesting anecdote on that. 00:06:47.600 |
My first meeting, and we had hired a few more people, 00:06:51.640 |
Nine out of 10 people thought it can't be done. 00:07:01.600 |
Actually I should say, and one was semi-optimistic. 00:07:15.000 |
"like either telephony speech for customer service calls," 00:07:20.000 |
But this was the kind of belief you must have. 00:07:21.880 |
And I had experience with far field speech recognition, 00:07:24.200 |
and my eyes lit up when I saw a problem like that, 00:07:27.160 |
saying, "Okay, we have been in speech recognition 00:07:35.680 |
to bring something delightful in the hands of customers. 00:07:38.680 |
- You mentioned the way you kind of think of it 00:07:42.480 |
have a press release and an FAQ, and you think backwards. 00:07:45.960 |
- Did you have, did the team have the echo in mind? 00:07:52.880 |
actually putting a thing in the home that works, 00:07:58.840 |
- It was very close, I would say, in terms of the, 00:08:01.320 |
as I said, the vision was Star Trek computer, right? 00:08:06.760 |
And from there, I can't divulge all the exact specifications, 00:08:21.040 |
because my taste was still in when I was an undergrad. 00:08:25.440 |
and it was too hard for me to be a music fan with a phone. 00:08:30.440 |
Right, so I, and I don't, I hate things in my ear. 00:08:45.920 |
in terms of how far are we from the original vision? 00:08:54.320 |
because every day we go in and thinking like, 00:08:57.000 |
these are the new set of challenges to solve. 00:08:58.840 |
- Yeah, it's a great way to do great engineering 00:09:06.600 |
but it's just a super nice way to have a focus. 00:09:11.120 |
and a lot of my scientists have adopted that. 00:09:20.720 |
but they are all after you've done the research 00:09:23.280 |
or you've proven, and your PhD dissertation proposal 00:09:30.960 |
is the closest that comes to a press release. 00:09:33.400 |
But that process is now ingrained in our scientists, 00:09:39.600 |
- You write the paper first and then make it happen. 00:09:43.680 |
In fact, it's not- - State of the art results. 00:09:48.240 |
but you have a thesis about, here's what I expect, right? 00:09:58.840 |
So far-field recognition, what was the big leap? 00:10:16.320 |
And what we first did was got a lot of training data 00:10:26.000 |
So how do you collect data in far-field setup, right? 00:10:40.560 |
we didn't talk about like, what would magical mean 00:10:47.320 |
That's always, since you've never done this before, 00:11:07.240 |
to where it's given you have no customer right now? 00:11:21.680 |
And I can just tell you that the combination of the two 00:11:34.920 |
at that point, I got the conviction that this will work. 00:11:49.320 |
That we felt would be where people will use it, 00:11:58.680 |
If we had launched in November 2014 is when we launched, 00:12:07.880 |
- Yeah, and just having looked at voice-based interactions, 00:12:15.800 |
it's a source of huge frustration for people. 00:12:20.120 |
for collecting data on subjects to measure frustration. 00:12:24.440 |
So as a training set for computer vision, for face data, 00:12:28.080 |
so we can get a data set of frustrated people. 00:12:32.080 |
is having them interact with a voice-based system in the car. 00:12:39.320 |
and we talked about how also errors are perceived 00:12:45.200 |
But we are not done with the problems that ended up, 00:12:59.120 |
as multi-domain natural language understanding. 00:13:18.400 |
like information, other kinds of household productivity, 00:13:22.200 |
alarms, timers, even though it wasn't as big as it is 00:13:37.800 |
So now you're looking at meaning understanding 00:13:41.680 |
on behalf of customers based on their requests. 00:13:47.760 |
Even if you have gotten the words recognized, 00:13:52.840 |
In those days, there was still a lot of emphasis 00:13:58.760 |
on rule-based systems for writing grammar patterns 00:14:03.720 |
but we had a statistical first approach even then, 00:14:11.160 |
an entity recognizer and an intent classifier, 00:14:17.960 |
In fact, we had to build the deterministic matching 00:14:21.200 |
as a follow-up to fix bugs that statistical models have. 00:14:28.040 |
where we focused on data-driven statistical understanding. 00:14:31.840 |
- Wins in the end if you have a huge dataset. 00:14:36.240 |
And that's why it came back to how do you get the data. 00:14:38.920 |
Before customers, the fact that this is why data 00:14:45.120 |
that you have the understanding system built up. 00:14:52.560 |
we were talking about human-machine dialogue, 00:14:54.320 |
even those early days, even it was very much transactional, 00:14:59.080 |
do one thing, one shot at transistors in great way. 00:15:02.360 |
There was a lot of debate on how much should Alexa talk back 00:15:20.160 |
And you don't want the match to be Stone Temple Pilots 00:15:32.320 |
And now there we had great assets from Amazon 00:15:46.040 |
I mean, even if you figured out the Stones is an entity, 00:15:50.000 |
you have to resolve it to whether it's the Stones 00:15:52.040 |
or the Stone Temple Pilots or some other Stones. 00:15:58.160 |
or is the job of UX communicating with the human 00:16:08.640 |
without any further questioning or UX, right? 00:16:26.080 |
with more self-learning with these kind of feedback signals. 00:16:33.120 |
of understanding the intent and resolving to an action 00:16:36.360 |
where action could be play a particular artist 00:16:41.800 |
Again, the bar was high as you're talking about, right? 00:16:45.280 |
So while we launched it in sort of 13 big domains, 00:16:51.280 |
or we think of it as 13, the big skills we had, 00:16:54.640 |
like music is a massive one when we launched it.