Rohit Prasad: Solving Far-Field Speech Recognition and Intent Understanding

00:00:00.000 | The inspiration was the Star Trek computer.

00:00:03.920 | So when you think of it that way,

00:00:05.360 | you know, everything is possible,

00:00:07.060 | but when you launch a product,

00:00:08.160 | you have to start with some place.

00:00:10.840 | And when I joined, the product was already in conception

00:00:15.320 | and we started working on the far field speech recognition

00:00:18.760 | because that was the first thing to solve.

00:00:20.760 | By that, we mean that you should be able to speak

00:00:22.680 | to the device from a distance.

00:00:25.080 | And in those days, that wasn't a common practice.

00:00:28.640 | And even in the previous research world I was in,

00:00:32.200 | was considered to a unsolvable problem then

00:00:34.480 | in terms of whether you can converse from a length.

00:00:38.200 | And here I'm still talking about the first part

00:00:40.220 | of the problem where you say,

00:00:42.300 | get the attention of the device,

00:00:43.960 | as in by saying what we call the wake word,

00:00:47.000 | which means the word Alexa has to be detected

00:00:50.240 | with a very high accuracy because it is a very common word.

00:00:54.760 | It has sound units that map with words like I like you

00:00:58.120 | or Alec, Alex, right?

00:01:01.000 | So it's a undoubtedly hard problem

00:01:04.680 | to detect the right mentions of Alexa's address

00:01:08.840 | to the device versus I like Alexa.

00:01:12.680 | - So you have to pick up that signal

00:01:14.120 | when there's a lot of noise.

00:01:15.920 | - Not only noise, but a lot of conversation in the house.

00:01:19.280 | Remember on the device,

00:01:20.160 | you're simply listening for the wake word, Alexa.

00:01:23.040 | And there's a lot of words being spoken in the house.

00:01:25.640 | How do you know it's Alexa and directed at Alexa?

00:01:30.640 | Because I could say, I love my Alexa, I hate my Alexa,

00:01:35.200 | I want Alexa to do this.

00:01:36.880 | And in all these three sentences I said Alexa,

00:01:39.160 | I didn't want it to wake up.

00:01:40.600 | - Can I just pause on that a second?

00:01:43.600 | What would be your device that I should probably

00:01:46.560 | in the introduction of this conversation give to people

00:01:49.800 | in terms of with them turning off their Alexa device

00:01:53.340 | if they're listening to this podcast conversation out loud?

00:01:58.340 | Like what's the probability

00:02:00.440 | that an Alexa device will go off?

00:02:02.160 | Because we mentioned Alexa like a million times.

00:02:05.080 | - So it will, we have done a lot of different things

00:02:08.000 | where we can figure out that there is the device,

00:02:13.000 | the speech is coming from a human versus over the air.

00:02:18.120 | Also, I mean, in terms of like, also it is,

00:02:20.480 | think about ads or,

00:02:22.720 | so we also launched a technology

00:02:24.120 | for watermarking kind of approaches

00:02:26.160 | in terms of filtering it out.

00:02:28.680 | But yes, if this kind of a podcast is happening,

00:02:31.500 | it's possible your device will wake up a few times.

00:02:34.240 | It's an unsolved problem,

00:02:35.320 | but it is definitely something we care very much about.

00:02:40.320 | - But the idea is you want to detect Alexa.

00:02:43.800 | - Meant for the device.

00:02:45.960 | - First of all, just even hearing Alexa

00:02:47.400 | versus I like something, I mean, that's a fascinating part.

00:02:50.960 | So that was the first relief.

00:02:52.920 | - That's the first part.

00:02:53.760 | - The world's best detector of Alexa.

00:02:55.840 | - Yeah, the world's best wake word detector

00:02:58.600 | in a far field setting,

00:02:59.800 | not like something where the phone is sitting on the table.

00:03:02.820 | This is like people have devices 40 feet away,

00:03:06.560 | like in my house or 20 feet away,

00:03:08.240 | and you still get an answer.

00:03:10.520 | So that was the first part.

00:03:12.360 | The next is, okay, you're speaking to the device.

00:03:15.720 | Of course, you're gonna issue many different requests.

00:03:18.880 | Some may be simple, some may be extremely hard,

00:03:21.400 | but it's a large vocabulary,

00:03:22.480 | speech recognition problem essentially,

00:03:24.480 | where the audio is now not coming onto your phone

00:03:28.640 | or a handheld mic like this or a close talking mic,

00:03:31.880 | but it's from 20 feet away

00:03:33.680 | where if you're in a busy household,

00:03:36.080 | your son may be listening to music,

00:03:38.640 | your daughter may be running around with something

00:03:41.400 | and asking your mom something and so forth, right?

00:03:43.640 | So this is like a common household setting

00:03:46.200 | where the words you're speaking to Alexa

00:03:50.000 | need to be recognized with very high accuracy, right?

00:03:53.200 | Now we are still just in the recognition problem.

00:03:55.640 | We haven't yet come to the understanding one, right?

00:03:57.960 | - And if I pause, I'm sorry, once again,

00:03:59.960 | what year was this?

00:04:00.800 | Is this before neural networks began to start

00:04:05.280 | to seriously prove themselves in the audio space?

00:04:10.320 | - Yeah, this is around, so I joined in 2013, in April, right?

00:04:15.400 | So the early research and neural networks coming back

00:04:18.720 | and showing some promising results

00:04:21.160 | in speech recognition space had started happening,

00:04:23.480 | but it was very early.

00:04:25.280 | But we just to now build on that,

00:04:27.720 | on the very first thing we did when I joined the team,

00:04:32.720 | and remember it was a very much of a startup environment,

00:04:35.840 | which is great about Amazon.

00:04:38.040 | And we doubled on deep learning right away.

00:04:41.160 | And we knew we'll have to improve accuracy fast.

00:04:46.160 | And because of that, we worked on,

00:04:48.880 | and the scale of data, once you have a device like this,

00:04:51.600 | if it is successful, will improve big time.

00:04:54.880 | Like you'll suddenly have large volumes of data

00:04:58.000 | to learn from to make the customer experience better.

00:05:01.000 | So how do you scale deep learning?

00:05:02.440 | So we did one of the first works

00:05:04.520 | in training with distributed GPUs

00:05:07.560 | and where the training time was linear

00:05:11.360 | in terms of like in the amount of data.

00:05:13.920 | So that was quite important work

00:05:16.160 | where it was algorithmic improvements

00:05:17.800 | as well as a lot of engineering improvements

00:05:19.880 | to be able to train on thousands and thousands of speech.

00:05:23.960 | And that was an important factor.

00:05:25.560 | So if you ask me like in back in 2013 and 2014,

00:05:29.280 | when we launched Echo,

00:05:32.400 | the combination of large scale data,

00:05:35.640 | deep learning progress, near infinite GPUs

00:05:39.640 | we had available on AWS even then,

00:05:43.040 | was all came together for us to be able

00:05:45.240 | to solve the far field speech recognition

00:05:48.360 | to the extent it could be useful to the customers.

00:05:50.560 | It's still not solved.

00:05:51.400 | Like, I mean, it's not that we are perfect

00:05:52.960 | at recognizing speech, but we are great at it

00:05:55.440 | in terms of the settings that are in homes, right?

00:05:58.320 | So, and that was important even in the early stages.

00:06:00.880 | - So first of all, just even,

00:06:01.880 | I'm trying to look back at that time.

00:06:05.000 | If I remember correctly,

00:06:06.960 | it seems like the task would be pretty daunting.

00:06:11.080 | So like, so we kind of take it for granted

00:06:14.280 | that it works now.

00:06:16.200 | - Yes, you're right.

00:06:17.520 | - So let me like how, first of all, you mentioned startup.

00:06:20.680 | I wasn't familiar how big the team was.

00:06:22.680 | I kind of, 'cause I know there's a lot

00:06:24.000 | of really smart people working on it.

00:06:25.840 | So now it's a very, very large team.

00:06:27.680 | How big was the team?

00:06:30.640 | How likely were you to fail in the eyes of everyone else?

00:06:34.000 | (laughing)

00:06:35.320 | - And ourselves?

00:06:36.160 | (laughing)

00:06:37.000 | - And yourself?

00:06:37.840 | So like what?

00:06:38.680 | - I'll give you a very interesting anecdote on that.

00:06:41.520 | When I joined the team,

00:06:43.800 | the speech recognition team was six people.

00:06:47.600 | My first meeting, and we had hired a few more people,

00:06:50.440 | it was 10 people.

00:06:51.640 | Nine out of 10 people thought it can't be done.

00:06:55.480 | Right?

00:06:58.320 | - Who was the one?

00:06:59.160 | (laughing)

00:07:00.000 | - The one was me, say.

00:07:01.600 | Actually I should say, and one was semi-optimistic.

00:07:05.000 | - Yeah.

00:07:05.840 | - And eight were trying to convince,

00:07:08.880 | "Let's go to the management and say,

00:07:11.480 | "let's not work on this problem.

00:07:13.360 | "Let's work on some other problem,

00:07:15.000 | "like either telephony speech for customer service calls,"

00:07:18.840 | and so forth.

00:07:20.000 | But this was the kind of belief you must have.

00:07:21.880 | And I had experience with far field speech recognition,

00:07:24.200 | and my eyes lit up when I saw a problem like that,

00:07:27.160 | saying, "Okay, we have been in speech recognition

00:07:30.600 | "always looking for that killer app."

00:07:32.800 | - Yeah.

00:07:33.640 | - And this was a killer use case

00:07:35.680 | to bring something delightful in the hands of customers.

00:07:38.680 | - You mentioned the way you kind of think of it

00:07:41.000 | in a product way in the future,

00:07:42.480 | have a press release and an FAQ, and you think backwards.

00:07:45.120 | - That's right.

00:07:45.960 | - Did you have, did the team have the echo in mind?

00:07:49.720 | So this far field speech recognition,

00:07:52.880 | actually putting a thing in the home that works,

00:07:55.200 | that it's able to interact with,

00:07:56.480 | was that the press release?

00:07:58.000 | What was the--

00:07:58.840 | - It was very close, I would say, in terms of the,

00:08:01.320 | as I said, the vision was Star Trek computer, right?

00:08:04.600 | So, or the inspiration.

00:08:06.760 | And from there, I can't divulge all the exact specifications,

00:08:10.480 | but one of the first things that

00:08:13.560 | was magical on Alexa was music.

00:08:18.680 | It brought me to back to music

00:08:21.040 | because my taste was still in when I was an undergrad.

00:08:24.040 | So I still listened to those songs,

00:08:25.440 | and it was too hard for me to be a music fan with a phone.

00:08:30.440 | Right, so I, and I don't, I hate things in my ear.

00:08:34.000 | So from that perspective, it was quite hard.

00:08:37.960 | And music was part of the,

00:08:40.360 | at least the documents I have seen, right?

00:08:43.440 | So from that perspective, I think, yes,

00:08:45.920 | in terms of how far are we from the original vision?

00:08:50.760 | I can't reveal that,

00:08:51.840 | but that's why I have a ton of fun at work

00:08:54.320 | because every day we go in and thinking like,

00:08:57.000 | these are the new set of challenges to solve.

00:08:58.840 | - Yeah, it's a great way to do great engineering

00:09:01.680 | as you think of the press release.

00:09:03.400 | I like that idea, actually.

00:09:04.800 | Maybe we'll talk about it a bit later,

00:09:06.600 | but it's just a super nice way to have a focus.

00:09:09.040 | - I'll tell you this, you're a scientist,

00:09:11.120 | and a lot of my scientists have adopted that.

00:09:13.520 | They have now, they love it as a process

00:09:16.760 | because it was very, as scientists,

00:09:18.760 | you're trained to write great papers,

00:09:20.720 | but they are all after you've done the research

00:09:23.280 | or you've proven, and your PhD dissertation proposal

00:09:26.400 | is something that comes closest,

00:09:28.240 | or a DARPA proposal or a NSF proposal

00:09:30.960 | is the closest that comes to a press release.

00:09:33.400 | But that process is now ingrained in our scientists,

00:09:36.800 | which is delightful for me to see.

00:09:39.600 | - You write the paper first and then make it happen.

00:09:42.840 | - That's right.

00:09:43.680 | In fact, it's not- - State of the art results.

00:09:46.080 | - Or you leave the results section open,

00:09:48.240 | but you have a thesis about, here's what I expect, right?

00:09:51.440 | And here's what it will change, right?

00:09:54.760 | So I think it is a great thing.

00:09:56.320 | It works for researchers as well.

00:09:58.000 | - Yeah.

00:09:58.840 | So far-field recognition, what was the big leap?

00:10:03.680 | What were the breakthroughs?

00:10:05.240 | And what was that journey like to today?

00:10:08.200 | - Yeah, I think the, as you said,

00:10:09.720 | first there was a lot of skepticism

00:10:11.400 | on whether far-field speech recognition

00:10:13.160 | will ever work to be good enough, right?

00:10:16.320 | And what we first did was got a lot of training data

00:10:19.800 | in a far-field setting.

00:10:21.280 | And that was extremely hard to get

00:10:23.840 | because none of it existed.

00:10:26.000 | So how do you collect data in far-field setup, right?

00:10:29.880 | - With no customer base at this time.

00:10:31.200 | - With no customer base, right?

00:10:32.480 | So that was first innovation.

00:10:34.600 | And once we had that, the next thing was,

00:10:36.800 | okay, if you have the data, first of all,

00:10:40.560 | we didn't talk about like, what would magical mean

00:10:43.680 | in this kind of a setting?

00:10:45.080 | What is good enough for customers, right?

00:10:47.320 | That's always, since you've never done this before,

00:10:50.280 | what would be magical?

00:10:51.400 | So it wasn't just a research problem.

00:10:54.000 | You had to put some, in terms of accuracy

00:10:57.440 | and customer experience features,

00:10:59.680 | some stakes on the ground saying,

00:11:01.280 | here's where I think it should get to.

00:11:04.720 | So you established a bar.

00:11:05.800 | And then how do you measure progress

00:11:07.240 | to where it's given you have no customer right now?

00:11:11.520 | So from that perspective, we went,

00:11:14.000 | so first was the data without customers.

00:11:17.320 | Second was doubling down on deep learning

00:11:20.320 | as a way to learn.

00:11:21.680 | And I can just tell you that the combination of the two

00:11:25.920 | got our error rates by a factor of five.

00:11:28.960 | From where we were when I started to,

00:11:32.040 | within six months of having that data,

00:11:34.920 | at that point, I got the conviction that this will work.

00:11:39.400 | Right, so because that was magical

00:11:41.400 | in terms of when it started working.

00:11:43.600 | And--

00:11:44.440 | - That reached the magical--

00:11:45.640 | - That came close to the magical bar.

00:11:47.480 | - That to the bar, right?

00:11:49.320 | That we felt would be where people will use it,

00:11:54.080 | which was critical.

00:11:55.160 | Because you really have one chance at this.

00:11:58.680 | If we had launched in November 2014 is when we launched,

00:12:01.720 | if it was below the bar,

00:12:02.920 | I don't think this category exists

00:12:06.320 | if you don't meet the bar.

00:12:07.880 | - Yeah, and just having looked at voice-based interactions,

00:12:11.840 | like in the car, earlier systems,

00:12:15.800 | it's a source of huge frustration for people.

00:12:18.120 | In fact, we use voice-based interaction

00:12:20.120 | for collecting data on subjects to measure frustration.

00:12:24.440 | So as a training set for computer vision, for face data,

00:12:28.080 | so we can get a data set of frustrated people.

00:12:30.440 | That's the best way to get frustrated people

00:12:32.080 | is having them interact with a voice-based system in the car.

00:12:35.400 | So that bar, I imagine, is pretty high.

00:12:38.360 | - It was very high,

00:12:39.320 | and we talked about how also errors are perceived

00:12:42.520 | from AIs versus errors by humans.

00:12:45.200 | But we are not done with the problems that ended up,

00:12:49.680 | we had to solve to get it to launch.

00:12:51.000 | So do you want the next one?

00:12:52.440 | - Yeah, the next one.

00:12:53.480 | - So the next one was what I think of

00:12:59.120 | as multi-domain natural language understanding.

00:13:02.280 | It's very, I wouldn't say easy,

00:13:04.520 | but it is during those days,

00:13:08.800 | solving it, understanding in one domain,

00:13:11.120 | a narrow domain, was doable.

00:13:13.800 | But for these multiple domains like music,

00:13:18.400 | like information, other kinds of household productivity,

00:13:22.200 | alarms, timers, even though it wasn't as big as it is

00:13:25.640 | in terms of the number of skills Alexa has

00:13:27.280 | and the confusion space has grown by

00:13:30.240 | three orders of magnitude,

00:13:32.280 | it was still daunting even those days.

00:13:34.120 | - Again, no customer base yet.

00:13:36.200 | - Again, no customer base.

00:13:37.800 | So now you're looking at meaning understanding

00:13:39.680 | and intent understanding and taking actions

00:13:41.680 | on behalf of customers based on their requests.

00:13:44.920 | And that is the next hard problem.

00:13:47.760 | Even if you have gotten the words recognized,

00:13:51.280 | how do you make sense of them?

00:13:52.840 | In those days, there was still a lot of emphasis

00:13:58.760 | on rule-based systems for writing grammar patterns

00:14:02.160 | to understand the intent,

00:14:03.720 | but we had a statistical first approach even then,

00:14:07.000 | where for a language understanding we had,

00:14:09.840 | even those starting days,

00:14:11.160 | an entity recognizer and an intent classifier,

00:14:15.280 | which was all trained statistically.

00:14:17.960 | In fact, we had to build the deterministic matching

00:14:21.200 | as a follow-up to fix bugs that statistical models have.

00:14:26.040 | So it was just a different mindset

00:14:28.040 | where we focused on data-driven statistical understanding.

00:14:31.840 | - Wins in the end if you have a huge dataset.

00:14:34.520 | - Yes, it is contingent on that.

00:14:36.240 | And that's why it came back to how do you get the data.

00:14:38.920 | Before customers, the fact that this is why data

00:14:42.280 | becomes crucial to get to the point

00:14:45.120 | that you have the understanding system built up.

00:14:49.920 | And notice that for you,

00:14:52.560 | we were talking about human-machine dialogue,

00:14:54.320 | even those early days, even it was very much transactional,

00:14:59.080 | do one thing, one shot at transistors in great way.

00:15:02.360 | There was a lot of debate on how much should Alexa talk back

00:15:04.680 | in terms of if it misunderstood you

00:15:07.240 | or you said play songs by the Stones.

00:15:11.320 | And let's say it doesn't know, early days,

00:15:14.600 | knowledge can be sparse.

00:15:16.880 | Who are the Stones?

00:15:18.040 | The Rolling Stones.

00:15:20.160 | And you don't want the match to be Stone Temple Pilots

00:15:26.160 | or Rolling Stones.

00:15:27.080 | So you don't know which one it is.

00:15:28.720 | So these kind of other signals to...

00:15:32.320 | And now there we had great assets from Amazon

00:15:35.760 | in terms of...

00:15:36.880 | - UX, like what is it?

00:15:38.240 | What kind of...

00:15:39.400 | Yeah, how do you solve that problem?

00:15:41.080 | - In terms of what we think of it

00:15:41.920 | as an entity resolution problem, right?

00:15:43.840 | So because which one is it, right?

00:15:46.040 | I mean, even if you figured out the Stones is an entity,

00:15:50.000 | you have to resolve it to whether it's the Stones

00:15:52.040 | or the Stone Temple Pilots or some other Stones.

00:15:54.680 | - Maybe I misunderstood.

00:15:55.600 | Is the resolution the job of the algorithm

00:15:58.160 | or is the job of UX communicating with the human

00:16:01.520 | to help the resolution?

00:16:02.360 | - Well, there is both, right?

00:16:04.120 | It is...

00:16:05.280 | You want 90% or high 90s to be done

00:16:08.640 | without any further questioning or UX, right?

00:16:11.200 | But it's absolutely okay.

00:16:14.040 | Just like as humans, we ask the question,

00:16:16.760 | I didn't understand you, Alex.

00:16:18.840 | It's fine for Alexa to occasionally say,

00:16:20.520 | I did not understand you, right?

00:16:21.960 | And that's an important way to learn.

00:16:24.520 | And I'll talk about where we have come

00:16:26.080 | with more self-learning with these kind of feedback signals.

00:16:29.960 | But in those days, just solving the ability

00:16:33.120 | of understanding the intent and resolving to an action

00:16:36.360 | where action could be play a particular artist

00:16:38.600 | or a particular song was super hard.

00:16:41.800 | Again, the bar was high as you're talking about, right?

00:16:45.280 | So while we launched it in sort of 13 big domains,

00:16:50.120 | I would say, in terms of,

00:16:51.280 | or we think of it as 13, the big skills we had,

00:16:54.640 | like music is a massive one when we launched it.

00:16:57.600 | And now we have 90,000 plus skills on Alexa.

00:17:01.200 | (upbeat music)

00:17:03.800 | (upbeat music)

00:17:06.400 | (upbeat music)

00:17:09.000 | (upbeat music)

00:17:11.600 | (upbeat music)

00:17:14.200 | (upbeat music)

00:17:16.800 | [BLANK_AUDIO]

Rohit Prasad: Solving Far-Field Speech Recognition and Intent Understanding | AI Podcast Clips