Back to Index

Rohit Prasad: Solving Far-Field Speech Recognition and Intent Understanding | AI Podcast Clips


Transcript

The inspiration was the Star Trek computer. So when you think of it that way, you know, everything is possible, but when you launch a product, you have to start with some place. And when I joined, the product was already in conception and we started working on the far field speech recognition because that was the first thing to solve.

By that, we mean that you should be able to speak to the device from a distance. And in those days, that wasn't a common practice. And even in the previous research world I was in, was considered to a unsolvable problem then in terms of whether you can converse from a length.

And here I'm still talking about the first part of the problem where you say, get the attention of the device, as in by saying what we call the wake word, which means the word Alexa has to be detected with a very high accuracy because it is a very common word.

It has sound units that map with words like I like you or Alec, Alex, right? So it's a undoubtedly hard problem to detect the right mentions of Alexa's address to the device versus I like Alexa. - So you have to pick up that signal when there's a lot of noise.

- Not only noise, but a lot of conversation in the house. Remember on the device, you're simply listening for the wake word, Alexa. And there's a lot of words being spoken in the house. How do you know it's Alexa and directed at Alexa? Because I could say, I love my Alexa, I hate my Alexa, I want Alexa to do this.

And in all these three sentences I said Alexa, I didn't want it to wake up. - Can I just pause on that a second? What would be your device that I should probably in the introduction of this conversation give to people in terms of with them turning off their Alexa device if they're listening to this podcast conversation out loud?

Like what's the probability that an Alexa device will go off? Because we mentioned Alexa like a million times. - So it will, we have done a lot of different things where we can figure out that there is the device, the speech is coming from a human versus over the air.

Also, I mean, in terms of like, also it is, think about ads or, so we also launched a technology for watermarking kind of approaches in terms of filtering it out. But yes, if this kind of a podcast is happening, it's possible your device will wake up a few times.

It's an unsolved problem, but it is definitely something we care very much about. - But the idea is you want to detect Alexa. - Meant for the device. - First of all, just even hearing Alexa versus I like something, I mean, that's a fascinating part. So that was the first relief.

- That's the first part. - The world's best detector of Alexa. - Yeah, the world's best wake word detector in a far field setting, not like something where the phone is sitting on the table. This is like people have devices 40 feet away, like in my house or 20 feet away, and you still get an answer.

So that was the first part. The next is, okay, you're speaking to the device. Of course, you're gonna issue many different requests. Some may be simple, some may be extremely hard, but it's a large vocabulary, speech recognition problem essentially, where the audio is now not coming onto your phone or a handheld mic like this or a close talking mic, but it's from 20 feet away where if you're in a busy household, your son may be listening to music, your daughter may be running around with something and asking your mom something and so forth, right?

So this is like a common household setting where the words you're speaking to Alexa need to be recognized with very high accuracy, right? Now we are still just in the recognition problem. We haven't yet come to the understanding one, right? - And if I pause, I'm sorry, once again, what year was this?

Is this before neural networks began to start to seriously prove themselves in the audio space? - Yeah, this is around, so I joined in 2013, in April, right? So the early research and neural networks coming back and showing some promising results in speech recognition space had started happening, but it was very early.

But we just to now build on that, on the very first thing we did when I joined the team, and remember it was a very much of a startup environment, which is great about Amazon. And we doubled on deep learning right away. And we knew we'll have to improve accuracy fast.

And because of that, we worked on, and the scale of data, once you have a device like this, if it is successful, will improve big time. Like you'll suddenly have large volumes of data to learn from to make the customer experience better. So how do you scale deep learning?

So we did one of the first works in training with distributed GPUs and where the training time was linear in terms of like in the amount of data. So that was quite important work where it was algorithmic improvements as well as a lot of engineering improvements to be able to train on thousands and thousands of speech.

And that was an important factor. So if you ask me like in back in 2013 and 2014, when we launched Echo, the combination of large scale data, deep learning progress, near infinite GPUs we had available on AWS even then, was all came together for us to be able to solve the far field speech recognition to the extent it could be useful to the customers.

It's still not solved. Like, I mean, it's not that we are perfect at recognizing speech, but we are great at it in terms of the settings that are in homes, right? So, and that was important even in the early stages. - So first of all, just even, I'm trying to look back at that time.

If I remember correctly, it seems like the task would be pretty daunting. So like, so we kind of take it for granted that it works now. - Yes, you're right. - So let me like how, first of all, you mentioned startup. I wasn't familiar how big the team was.

I kind of, 'cause I know there's a lot of really smart people working on it. So now it's a very, very large team. How big was the team? How likely were you to fail in the eyes of everyone else? (laughing) - And ourselves? (laughing) - And yourself? So like what?

- I'll give you a very interesting anecdote on that. When I joined the team, the speech recognition team was six people. My first meeting, and we had hired a few more people, it was 10 people. Nine out of 10 people thought it can't be done. Right? - Who was the one?

(laughing) - The one was me, say. Actually I should say, and one was semi-optimistic. - Yeah. - And eight were trying to convince, "Let's go to the management and say, "let's not work on this problem. "Let's work on some other problem, "like either telephony speech for customer service calls," and so forth.

But this was the kind of belief you must have. And I had experience with far field speech recognition, and my eyes lit up when I saw a problem like that, saying, "Okay, we have been in speech recognition "always looking for that killer app." - Yeah. - And this was a killer use case to bring something delightful in the hands of customers.

- You mentioned the way you kind of think of it in a product way in the future, have a press release and an FAQ, and you think backwards. - That's right. - Did you have, did the team have the echo in mind? So this far field speech recognition, actually putting a thing in the home that works, that it's able to interact with, was that the press release?

What was the-- - It was very close, I would say, in terms of the, as I said, the vision was Star Trek computer, right? So, or the inspiration. And from there, I can't divulge all the exact specifications, but one of the first things that was magical on Alexa was music.

It brought me to back to music because my taste was still in when I was an undergrad. So I still listened to those songs, and it was too hard for me to be a music fan with a phone. Right, so I, and I don't, I hate things in my ear.

So from that perspective, it was quite hard. And music was part of the, at least the documents I have seen, right? So from that perspective, I think, yes, in terms of how far are we from the original vision? I can't reveal that, but that's why I have a ton of fun at work because every day we go in and thinking like, these are the new set of challenges to solve.

- Yeah, it's a great way to do great engineering as you think of the press release. I like that idea, actually. Maybe we'll talk about it a bit later, but it's just a super nice way to have a focus. - I'll tell you this, you're a scientist, and a lot of my scientists have adopted that.

They have now, they love it as a process because it was very, as scientists, you're trained to write great papers, but they are all after you've done the research or you've proven, and your PhD dissertation proposal is something that comes closest, or a DARPA proposal or a NSF proposal is the closest that comes to a press release.

But that process is now ingrained in our scientists, which is delightful for me to see. - You write the paper first and then make it happen. - That's right. In fact, it's not- - State of the art results. - Or you leave the results section open, but you have a thesis about, here's what I expect, right?

And here's what it will change, right? So I think it is a great thing. It works for researchers as well. - Yeah. So far-field recognition, what was the big leap? What were the breakthroughs? And what was that journey like to today? - Yeah, I think the, as you said, first there was a lot of skepticism on whether far-field speech recognition will ever work to be good enough, right?

And what we first did was got a lot of training data in a far-field setting. And that was extremely hard to get because none of it existed. So how do you collect data in far-field setup, right? - With no customer base at this time. - With no customer base, right?

So that was first innovation. And once we had that, the next thing was, okay, if you have the data, first of all, we didn't talk about like, what would magical mean in this kind of a setting? What is good enough for customers, right? That's always, since you've never done this before, what would be magical?

So it wasn't just a research problem. You had to put some, in terms of accuracy and customer experience features, some stakes on the ground saying, here's where I think it should get to. So you established a bar. And then how do you measure progress to where it's given you have no customer right now?

So from that perspective, we went, so first was the data without customers. Second was doubling down on deep learning as a way to learn. And I can just tell you that the combination of the two got our error rates by a factor of five. From where we were when I started to, within six months of having that data, at that point, I got the conviction that this will work.

Right, so because that was magical in terms of when it started working. And-- - That reached the magical-- - That came close to the magical bar. - That to the bar, right? That we felt would be where people will use it, which was critical. Because you really have one chance at this.

If we had launched in November 2014 is when we launched, if it was below the bar, I don't think this category exists if you don't meet the bar. - Yeah, and just having looked at voice-based interactions, like in the car, earlier systems, it's a source of huge frustration for people.

In fact, we use voice-based interaction for collecting data on subjects to measure frustration. So as a training set for computer vision, for face data, so we can get a data set of frustrated people. That's the best way to get frustrated people is having them interact with a voice-based system in the car.

So that bar, I imagine, is pretty high. - It was very high, and we talked about how also errors are perceived from AIs versus errors by humans. But we are not done with the problems that ended up, we had to solve to get it to launch. So do you want the next one?

- Yeah, the next one. - So the next one was what I think of as multi-domain natural language understanding. It's very, I wouldn't say easy, but it is during those days, solving it, understanding in one domain, a narrow domain, was doable. But for these multiple domains like music, like information, other kinds of household productivity, alarms, timers, even though it wasn't as big as it is in terms of the number of skills Alexa has and the confusion space has grown by three orders of magnitude, it was still daunting even those days.

- Again, no customer base yet. - Again, no customer base. So now you're looking at meaning understanding and intent understanding and taking actions on behalf of customers based on their requests. And that is the next hard problem. Even if you have gotten the words recognized, how do you make sense of them?

In those days, there was still a lot of emphasis on rule-based systems for writing grammar patterns to understand the intent, but we had a statistical first approach even then, where for a language understanding we had, even those starting days, an entity recognizer and an intent classifier, which was all trained statistically.

In fact, we had to build the deterministic matching as a follow-up to fix bugs that statistical models have. So it was just a different mindset where we focused on data-driven statistical understanding. - Wins in the end if you have a huge dataset. - Yes, it is contingent on that.

And that's why it came back to how do you get the data. Before customers, the fact that this is why data becomes crucial to get to the point that you have the understanding system built up. And notice that for you, we were talking about human-machine dialogue, even those early days, even it was very much transactional, do one thing, one shot at transistors in great way.

There was a lot of debate on how much should Alexa talk back in terms of if it misunderstood you or you said play songs by the Stones. And let's say it doesn't know, early days, knowledge can be sparse. Who are the Stones? The Rolling Stones. And you don't want the match to be Stone Temple Pilots or Rolling Stones.

So you don't know which one it is. So these kind of other signals to... And now there we had great assets from Amazon in terms of... - UX, like what is it? What kind of... Yeah, how do you solve that problem? - In terms of what we think of it as an entity resolution problem, right?

So because which one is it, right? I mean, even if you figured out the Stones is an entity, you have to resolve it to whether it's the Stones or the Stone Temple Pilots or some other Stones. - Maybe I misunderstood. Is the resolution the job of the algorithm or is the job of UX communicating with the human to help the resolution?

- Well, there is both, right? It is... You want 90% or high 90s to be done without any further questioning or UX, right? But it's absolutely okay. Just like as humans, we ask the question, I didn't understand you, Alex. It's fine for Alexa to occasionally say, I did not understand you, right?

And that's an important way to learn. And I'll talk about where we have come with more self-learning with these kind of feedback signals. But in those days, just solving the ability of understanding the intent and resolving to an action where action could be play a particular artist or a particular song was super hard.

Again, the bar was high as you're talking about, right? So while we launched it in sort of 13 big domains, I would say, in terms of, or we think of it as 13, the big skills we had, like music is a massive one when we launched it. And now we have 90,000 plus skills on Alexa.

(upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music)