back to index

Rohit Prasad: Solving Far-Field Speech Recognition and Intent Understanding | AI Podcast Clips

Whisper Transcript | Transcript Only Page

00:00:00.000 | The inspiration was the Star Trek computer.
00:00:03.920 | So when you think of it that way,
00:00:05.360 | you know, everything is possible,
00:00:07.060 | but when you launch a product,
00:00:08.160 | you have to start with some place.
00:00:10.840 | And when I joined, the product was already in conception
00:00:15.320 | and we started working on the far field speech recognition
00:00:18.760 | because that was the first thing to solve.
00:00:20.760 | By that, we mean that you should be able to speak
00:00:22.680 | to the device from a distance.
00:00:25.080 | And in those days, that wasn't a common practice.
00:00:28.640 | And even in the previous research world I was in,
00:00:32.200 | was considered to a unsolvable problem then
00:00:34.480 | in terms of whether you can converse from a length.
00:00:38.200 | And here I'm still talking about the first part
00:00:40.220 | of the problem where you say,
00:00:42.300 | get the attention of the device,
00:00:43.960 | as in by saying what we call the wake word,
00:00:47.000 | which means the word Alexa has to be detected
00:00:50.240 | with a very high accuracy because it is a very common word.
00:00:54.760 | It has sound units that map with words like I like you
00:00:58.120 | or Alec, Alex, right?
00:01:01.000 | So it's a undoubtedly hard problem
00:01:04.680 | to detect the right mentions of Alexa's address
00:01:08.840 | to the device versus I like Alexa.
00:01:12.680 | - So you have to pick up that signal
00:01:14.120 | when there's a lot of noise.
00:01:15.920 | - Not only noise, but a lot of conversation in the house.
00:01:19.280 | Remember on the device,
00:01:20.160 | you're simply listening for the wake word, Alexa.
00:01:23.040 | And there's a lot of words being spoken in the house.
00:01:25.640 | How do you know it's Alexa and directed at Alexa?
00:01:30.640 | Because I could say, I love my Alexa, I hate my Alexa,
00:01:35.200 | I want Alexa to do this.
00:01:36.880 | And in all these three sentences I said Alexa,
00:01:39.160 | I didn't want it to wake up.
00:01:40.600 | - Can I just pause on that a second?
00:01:43.600 | What would be your device that I should probably
00:01:46.560 | in the introduction of this conversation give to people
00:01:49.800 | in terms of with them turning off their Alexa device
00:01:53.340 | if they're listening to this podcast conversation out loud?
00:01:58.340 | Like what's the probability
00:02:00.440 | that an Alexa device will go off?
00:02:02.160 | Because we mentioned Alexa like a million times.
00:02:05.080 | - So it will, we have done a lot of different things
00:02:08.000 | where we can figure out that there is the device,
00:02:13.000 | the speech is coming from a human versus over the air.
00:02:18.120 | Also, I mean, in terms of like, also it is,
00:02:20.480 | think about ads or,
00:02:22.720 | so we also launched a technology
00:02:24.120 | for watermarking kind of approaches
00:02:26.160 | in terms of filtering it out.
00:02:28.680 | But yes, if this kind of a podcast is happening,
00:02:31.500 | it's possible your device will wake up a few times.
00:02:34.240 | It's an unsolved problem,
00:02:35.320 | but it is definitely something we care very much about.
00:02:40.320 | - But the idea is you want to detect Alexa.
00:02:43.800 | - Meant for the device.
00:02:45.960 | - First of all, just even hearing Alexa
00:02:47.400 | versus I like something, I mean, that's a fascinating part.
00:02:50.960 | So that was the first relief.
00:02:52.920 | - That's the first part.
00:02:53.760 | - The world's best detector of Alexa.
00:02:55.840 | - Yeah, the world's best wake word detector
00:02:58.600 | in a far field setting,
00:02:59.800 | not like something where the phone is sitting on the table.
00:03:02.820 | This is like people have devices 40 feet away,
00:03:06.560 | like in my house or 20 feet away,
00:03:08.240 | and you still get an answer.
00:03:10.520 | So that was the first part.
00:03:12.360 | The next is, okay, you're speaking to the device.
00:03:15.720 | Of course, you're gonna issue many different requests.
00:03:18.880 | Some may be simple, some may be extremely hard,
00:03:21.400 | but it's a large vocabulary,
00:03:22.480 | speech recognition problem essentially,
00:03:24.480 | where the audio is now not coming onto your phone
00:03:28.640 | or a handheld mic like this or a close talking mic,
00:03:31.880 | but it's from 20 feet away
00:03:33.680 | where if you're in a busy household,
00:03:36.080 | your son may be listening to music,
00:03:38.640 | your daughter may be running around with something
00:03:41.400 | and asking your mom something and so forth, right?
00:03:43.640 | So this is like a common household setting
00:03:46.200 | where the words you're speaking to Alexa
00:03:50.000 | need to be recognized with very high accuracy, right?
00:03:53.200 | Now we are still just in the recognition problem.
00:03:55.640 | We haven't yet come to the understanding one, right?
00:03:57.960 | - And if I pause, I'm sorry, once again,
00:03:59.960 | what year was this?
00:04:00.800 | Is this before neural networks began to start
00:04:05.280 | to seriously prove themselves in the audio space?
00:04:10.320 | - Yeah, this is around, so I joined in 2013, in April, right?
00:04:15.400 | So the early research and neural networks coming back
00:04:18.720 | and showing some promising results
00:04:21.160 | in speech recognition space had started happening,
00:04:23.480 | but it was very early.
00:04:25.280 | But we just to now build on that,
00:04:27.720 | on the very first thing we did when I joined the team,
00:04:32.720 | and remember it was a very much of a startup environment,
00:04:35.840 | which is great about Amazon.
00:04:38.040 | And we doubled on deep learning right away.
00:04:41.160 | And we knew we'll have to improve accuracy fast.
00:04:46.160 | And because of that, we worked on,
00:04:48.880 | and the scale of data, once you have a device like this,
00:04:51.600 | if it is successful, will improve big time.
00:04:54.880 | Like you'll suddenly have large volumes of data
00:04:58.000 | to learn from to make the customer experience better.
00:05:01.000 | So how do you scale deep learning?
00:05:02.440 | So we did one of the first works
00:05:04.520 | in training with distributed GPUs
00:05:07.560 | and where the training time was linear
00:05:11.360 | in terms of like in the amount of data.
00:05:13.920 | So that was quite important work
00:05:16.160 | where it was algorithmic improvements
00:05:17.800 | as well as a lot of engineering improvements
00:05:19.880 | to be able to train on thousands and thousands of speech.
00:05:23.960 | And that was an important factor.
00:05:25.560 | So if you ask me like in back in 2013 and 2014,
00:05:29.280 | when we launched Echo,
00:05:32.400 | the combination of large scale data,
00:05:35.640 | deep learning progress, near infinite GPUs
00:05:39.640 | we had available on AWS even then,
00:05:43.040 | was all came together for us to be able
00:05:45.240 | to solve the far field speech recognition
00:05:48.360 | to the extent it could be useful to the customers.
00:05:50.560 | It's still not solved.
00:05:51.400 | Like, I mean, it's not that we are perfect
00:05:52.960 | at recognizing speech, but we are great at it
00:05:55.440 | in terms of the settings that are in homes, right?
00:05:58.320 | So, and that was important even in the early stages.
00:06:00.880 | - So first of all, just even,
00:06:01.880 | I'm trying to look back at that time.
00:06:05.000 | If I remember correctly,
00:06:06.960 | it seems like the task would be pretty daunting.
00:06:11.080 | So like, so we kind of take it for granted
00:06:14.280 | that it works now.
00:06:16.200 | - Yes, you're right.
00:06:17.520 | - So let me like how, first of all, you mentioned startup.
00:06:20.680 | I wasn't familiar how big the team was.
00:06:22.680 | I kind of, 'cause I know there's a lot
00:06:24.000 | of really smart people working on it.
00:06:25.840 | So now it's a very, very large team.
00:06:27.680 | How big was the team?
00:06:30.640 | How likely were you to fail in the eyes of everyone else?
00:06:34.000 | (laughing)
00:06:35.320 | - And ourselves?
00:06:36.160 | (laughing)
00:06:37.000 | - And yourself?
00:06:37.840 | So like what?
00:06:38.680 | - I'll give you a very interesting anecdote on that.
00:06:41.520 | When I joined the team,
00:06:43.800 | the speech recognition team was six people.
00:06:47.600 | My first meeting, and we had hired a few more people,
00:06:50.440 | it was 10 people.
00:06:51.640 | Nine out of 10 people thought it can't be done.
00:06:55.480 | Right?
00:06:58.320 | - Who was the one?
00:06:59.160 | (laughing)
00:07:00.000 | - The one was me, say.
00:07:01.600 | Actually I should say, and one was semi-optimistic.
00:07:05.000 | - Yeah.
00:07:05.840 | - And eight were trying to convince,
00:07:08.880 | "Let's go to the management and say,
00:07:11.480 | "let's not work on this problem.
00:07:13.360 | "Let's work on some other problem,
00:07:15.000 | "like either telephony speech for customer service calls,"
00:07:18.840 | and so forth.
00:07:20.000 | But this was the kind of belief you must have.
00:07:21.880 | And I had experience with far field speech recognition,
00:07:24.200 | and my eyes lit up when I saw a problem like that,
00:07:27.160 | saying, "Okay, we have been in speech recognition
00:07:30.600 | "always looking for that killer app."
00:07:32.800 | - Yeah.
00:07:33.640 | - And this was a killer use case
00:07:35.680 | to bring something delightful in the hands of customers.
00:07:38.680 | - You mentioned the way you kind of think of it
00:07:41.000 | in a product way in the future,
00:07:42.480 | have a press release and an FAQ, and you think backwards.
00:07:45.120 | - That's right.
00:07:45.960 | - Did you have, did the team have the echo in mind?
00:07:49.720 | So this far field speech recognition,
00:07:52.880 | actually putting a thing in the home that works,
00:07:55.200 | that it's able to interact with,
00:07:56.480 | was that the press release?
00:07:58.000 | What was the--
00:07:58.840 | - It was very close, I would say, in terms of the,
00:08:01.320 | as I said, the vision was Star Trek computer, right?
00:08:04.600 | So, or the inspiration.
00:08:06.760 | And from there, I can't divulge all the exact specifications,
00:08:10.480 | but one of the first things that
00:08:13.560 | was magical on Alexa was music.
00:08:18.680 | It brought me to back to music
00:08:21.040 | because my taste was still in when I was an undergrad.
00:08:24.040 | So I still listened to those songs,
00:08:25.440 | and it was too hard for me to be a music fan with a phone.
00:08:30.440 | Right, so I, and I don't, I hate things in my ear.
00:08:34.000 | So from that perspective, it was quite hard.
00:08:37.960 | And music was part of the,
00:08:40.360 | at least the documents I have seen, right?
00:08:43.440 | So from that perspective, I think, yes,
00:08:45.920 | in terms of how far are we from the original vision?
00:08:50.760 | I can't reveal that,
00:08:51.840 | but that's why I have a ton of fun at work
00:08:54.320 | because every day we go in and thinking like,
00:08:57.000 | these are the new set of challenges to solve.
00:08:58.840 | - Yeah, it's a great way to do great engineering
00:09:01.680 | as you think of the press release.
00:09:03.400 | I like that idea, actually.
00:09:04.800 | Maybe we'll talk about it a bit later,
00:09:06.600 | but it's just a super nice way to have a focus.
00:09:09.040 | - I'll tell you this, you're a scientist,
00:09:11.120 | and a lot of my scientists have adopted that.
00:09:13.520 | They have now, they love it as a process
00:09:16.760 | because it was very, as scientists,
00:09:18.760 | you're trained to write great papers,
00:09:20.720 | but they are all after you've done the research
00:09:23.280 | or you've proven, and your PhD dissertation proposal
00:09:26.400 | is something that comes closest,
00:09:28.240 | or a DARPA proposal or a NSF proposal
00:09:30.960 | is the closest that comes to a press release.
00:09:33.400 | But that process is now ingrained in our scientists,
00:09:36.800 | which is delightful for me to see.
00:09:39.600 | - You write the paper first and then make it happen.
00:09:42.840 | - That's right.
00:09:43.680 | In fact, it's not- - State of the art results.
00:09:46.080 | - Or you leave the results section open,
00:09:48.240 | but you have a thesis about, here's what I expect, right?
00:09:51.440 | And here's what it will change, right?
00:09:54.760 | So I think it is a great thing.
00:09:56.320 | It works for researchers as well.
00:09:58.000 | - Yeah.
00:09:58.840 | So far-field recognition, what was the big leap?
00:10:03.680 | What were the breakthroughs?
00:10:05.240 | And what was that journey like to today?
00:10:08.200 | - Yeah, I think the, as you said,
00:10:09.720 | first there was a lot of skepticism
00:10:11.400 | on whether far-field speech recognition
00:10:13.160 | will ever work to be good enough, right?
00:10:16.320 | And what we first did was got a lot of training data
00:10:19.800 | in a far-field setting.
00:10:21.280 | And that was extremely hard to get
00:10:23.840 | because none of it existed.
00:10:26.000 | So how do you collect data in far-field setup, right?
00:10:29.880 | - With no customer base at this time.
00:10:31.200 | - With no customer base, right?
00:10:32.480 | So that was first innovation.
00:10:34.600 | And once we had that, the next thing was,
00:10:36.800 | okay, if you have the data, first of all,
00:10:40.560 | we didn't talk about like, what would magical mean
00:10:43.680 | in this kind of a setting?
00:10:45.080 | What is good enough for customers, right?
00:10:47.320 | That's always, since you've never done this before,
00:10:50.280 | what would be magical?
00:10:51.400 | So it wasn't just a research problem.
00:10:54.000 | You had to put some, in terms of accuracy
00:10:57.440 | and customer experience features,
00:10:59.680 | some stakes on the ground saying,
00:11:01.280 | here's where I think it should get to.
00:11:04.720 | So you established a bar.
00:11:05.800 | And then how do you measure progress
00:11:07.240 | to where it's given you have no customer right now?
00:11:11.520 | So from that perspective, we went,
00:11:14.000 | so first was the data without customers.
00:11:17.320 | Second was doubling down on deep learning
00:11:20.320 | as a way to learn.
00:11:21.680 | And I can just tell you that the combination of the two
00:11:25.920 | got our error rates by a factor of five.
00:11:28.960 | From where we were when I started to,
00:11:32.040 | within six months of having that data,
00:11:34.920 | at that point, I got the conviction that this will work.
00:11:39.400 | Right, so because that was magical
00:11:41.400 | in terms of when it started working.
00:11:43.600 | And--
00:11:44.440 | - That reached the magical--
00:11:45.640 | - That came close to the magical bar.
00:11:47.480 | - That to the bar, right?
00:11:49.320 | That we felt would be where people will use it,
00:11:54.080 | which was critical.
00:11:55.160 | Because you really have one chance at this.
00:11:58.680 | If we had launched in November 2014 is when we launched,
00:12:01.720 | if it was below the bar,
00:12:02.920 | I don't think this category exists
00:12:06.320 | if you don't meet the bar.
00:12:07.880 | - Yeah, and just having looked at voice-based interactions,
00:12:11.840 | like in the car, earlier systems,
00:12:15.800 | it's a source of huge frustration for people.
00:12:18.120 | In fact, we use voice-based interaction
00:12:20.120 | for collecting data on subjects to measure frustration.
00:12:24.440 | So as a training set for computer vision, for face data,
00:12:28.080 | so we can get a data set of frustrated people.
00:12:30.440 | That's the best way to get frustrated people
00:12:32.080 | is having them interact with a voice-based system in the car.
00:12:35.400 | So that bar, I imagine, is pretty high.
00:12:38.360 | - It was very high,
00:12:39.320 | and we talked about how also errors are perceived
00:12:42.520 | from AIs versus errors by humans.
00:12:45.200 | But we are not done with the problems that ended up,
00:12:49.680 | we had to solve to get it to launch.
00:12:51.000 | So do you want the next one?
00:12:52.440 | - Yeah, the next one.
00:12:53.480 | - So the next one was what I think of
00:12:59.120 | as multi-domain natural language understanding.
00:13:02.280 | It's very, I wouldn't say easy,
00:13:04.520 | but it is during those days,
00:13:08.800 | solving it, understanding in one domain,
00:13:11.120 | a narrow domain, was doable.
00:13:13.800 | But for these multiple domains like music,
00:13:18.400 | like information, other kinds of household productivity,
00:13:22.200 | alarms, timers, even though it wasn't as big as it is
00:13:25.640 | in terms of the number of skills Alexa has
00:13:27.280 | and the confusion space has grown by
00:13:30.240 | three orders of magnitude,
00:13:32.280 | it was still daunting even those days.
00:13:34.120 | - Again, no customer base yet.
00:13:36.200 | - Again, no customer base.
00:13:37.800 | So now you're looking at meaning understanding
00:13:39.680 | and intent understanding and taking actions
00:13:41.680 | on behalf of customers based on their requests.
00:13:44.920 | And that is the next hard problem.
00:13:47.760 | Even if you have gotten the words recognized,
00:13:51.280 | how do you make sense of them?
00:13:52.840 | In those days, there was still a lot of emphasis
00:13:58.760 | on rule-based systems for writing grammar patterns
00:14:02.160 | to understand the intent,
00:14:03.720 | but we had a statistical first approach even then,
00:14:07.000 | where for a language understanding we had,
00:14:09.840 | even those starting days,
00:14:11.160 | an entity recognizer and an intent classifier,
00:14:15.280 | which was all trained statistically.
00:14:17.960 | In fact, we had to build the deterministic matching
00:14:21.200 | as a follow-up to fix bugs that statistical models have.
00:14:26.040 | So it was just a different mindset
00:14:28.040 | where we focused on data-driven statistical understanding.
00:14:31.840 | - Wins in the end if you have a huge dataset.
00:14:34.520 | - Yes, it is contingent on that.
00:14:36.240 | And that's why it came back to how do you get the data.
00:14:38.920 | Before customers, the fact that this is why data
00:14:42.280 | becomes crucial to get to the point
00:14:45.120 | that you have the understanding system built up.
00:14:49.920 | And notice that for you,
00:14:52.560 | we were talking about human-machine dialogue,
00:14:54.320 | even those early days, even it was very much transactional,
00:14:59.080 | do one thing, one shot at transistors in great way.
00:15:02.360 | There was a lot of debate on how much should Alexa talk back
00:15:04.680 | in terms of if it misunderstood you
00:15:07.240 | or you said play songs by the Stones.
00:15:11.320 | And let's say it doesn't know, early days,
00:15:14.600 | knowledge can be sparse.
00:15:16.880 | Who are the Stones?
00:15:18.040 | The Rolling Stones.
00:15:20.160 | And you don't want the match to be Stone Temple Pilots
00:15:26.160 | or Rolling Stones.
00:15:27.080 | So you don't know which one it is.
00:15:28.720 | So these kind of other signals to...
00:15:32.320 | And now there we had great assets from Amazon
00:15:35.760 | in terms of...
00:15:36.880 | - UX, like what is it?
00:15:38.240 | What kind of...
00:15:39.400 | Yeah, how do you solve that problem?
00:15:41.080 | - In terms of what we think of it
00:15:41.920 | as an entity resolution problem, right?
00:15:43.840 | So because which one is it, right?
00:15:46.040 | I mean, even if you figured out the Stones is an entity,
00:15:50.000 | you have to resolve it to whether it's the Stones
00:15:52.040 | or the Stone Temple Pilots or some other Stones.
00:15:54.680 | - Maybe I misunderstood.
00:15:55.600 | Is the resolution the job of the algorithm
00:15:58.160 | or is the job of UX communicating with the human
00:16:01.520 | to help the resolution?
00:16:02.360 | - Well, there is both, right?
00:16:04.120 | It is...
00:16:05.280 | You want 90% or high 90s to be done
00:16:08.640 | without any further questioning or UX, right?
00:16:11.200 | But it's absolutely okay.
00:16:14.040 | Just like as humans, we ask the question,
00:16:16.760 | I didn't understand you, Alex.
00:16:18.840 | It's fine for Alexa to occasionally say,
00:16:20.520 | I did not understand you, right?
00:16:21.960 | And that's an important way to learn.
00:16:24.520 | And I'll talk about where we have come
00:16:26.080 | with more self-learning with these kind of feedback signals.
00:16:29.960 | But in those days, just solving the ability
00:16:33.120 | of understanding the intent and resolving to an action
00:16:36.360 | where action could be play a particular artist
00:16:38.600 | or a particular song was super hard.
00:16:41.800 | Again, the bar was high as you're talking about, right?
00:16:45.280 | So while we launched it in sort of 13 big domains,
00:16:50.120 | I would say, in terms of,
00:16:51.280 | or we think of it as 13, the big skills we had,
00:16:54.640 | like music is a massive one when we launched it.
00:16:57.600 | And now we have 90,000 plus skills on Alexa.
00:17:01.200 | (upbeat music)
00:17:03.800 | (upbeat music)
00:17:06.400 | (upbeat music)
00:17:09.000 | (upbeat music)
00:17:11.600 | (upbeat music)
00:17:14.200 | (upbeat music)
00:17:16.800 | [BLANK_AUDIO]