back to index

Rohit Prasad: Amazon Alexa and Conversational AI | Lex Fridman Podcast #57


Chapters

0:0
19:46 How Do the Conversations Evolve
48:15 Is Alexa Listening
52:32 Follow-Up Mode
53:2 Alexa Guard
54:9 History of Alexa
58:11 Speech Recognition
59:51 Scale Deep Learning
67:48 Multi-Domain Natural Language Understanding
70:32 Entity Resolution
72:27 Echo Plug
73:45 Alexa Conversations
79:57 Self Learning
87:39 Challenges
91:59 Transfer Learning
105:33 Words of Wisdom

Whisper Transcript | Transcript Only Page

00:00:00.000 | The following is a conversation with Rohit Prasad.
00:00:02.960 | He's the vice president and head scientist of Amazon Alexa
00:00:06.360 | and one of its original creators.
00:00:08.880 | The Alexa team embodies some of the most challenging,
00:00:12.120 | incredible, impactful, and inspiring work
00:00:14.960 | that is done in AI today.
00:00:17.040 | The team has to both solve problems
00:00:19.120 | at the cutting edge of natural language processing
00:00:21.720 | and provide a trustworthy, secure,
00:00:24.000 | and enjoyable experience to millions of people.
00:00:27.440 | This is where state-of-the-art methods in computer science
00:00:30.800 | meet the challenges of real-world engineering.
00:00:33.720 | In many ways, Alexa and the other voice assistants
00:00:37.280 | are the voices of artificial intelligence
00:00:39.480 | to millions of people and an introduction to AI
00:00:43.160 | for people who have only encountered it in science fiction.
00:00:46.920 | This is an important and exciting opportunity.
00:00:49.960 | And so the work that Rohit and the Alexa team are doing
00:00:52.920 | is an inspiration to me and to many researchers
00:00:55.920 | and engineers in the AI community.
00:00:58.840 | This is the Artificial Intelligence Podcast.
00:01:01.920 | If you enjoy it, subscribe on YouTube,
00:01:04.400 | get five stars on Apple Podcast,
00:01:06.360 | support it on Patreon,
00:01:07.720 | or simply connect with me on Twitter,
00:01:09.800 | @lexfriedman, spelled F-R-I-D-M-A-N.
00:01:13.680 | If you leave a review on Apple Podcasts especially,
00:01:16.920 | but also a cast box or comment on YouTube,
00:01:20.000 | consider mentioning topics, people, ideas, questions, quotes
00:01:23.600 | in science, tech, or philosophy that you find interesting.
00:01:26.400 | And I'll read them on this podcast.
00:01:28.760 | I won't call out names, but I love comments
00:01:31.600 | with kindness and thoughtfulness in them,
00:01:33.200 | so I thought I'd share them.
00:01:35.680 | Someone on YouTube highlighted a quote
00:01:37.440 | from the conversation with Ray Dalio,
00:01:40.240 | where he said that you have to appreciate
00:01:41.920 | all the different ways that people can be A players.
00:01:45.240 | This connected with me too.
00:01:46.960 | On teams of engineers, it's easy to think
00:01:49.200 | that raw productivity is the measure of excellence,
00:01:51.920 | but there are others.
00:01:53.400 | I've worked with people who brought a smile to my face
00:01:55.680 | every time I got to work in the morning.
00:01:57.880 | Their contribution to the team is immeasurable.
00:02:01.200 | I recently started doing podcast ads
00:02:03.000 | at the end of the introduction.
00:02:04.600 | I'll do one or two minutes after introducing the episode
00:02:07.600 | and never any ads in the middle
00:02:09.080 | that break the flow of the conversation.
00:02:11.480 | I hope that works for you.
00:02:12.960 | It doesn't hurt the listening experience.
00:02:15.600 | This show is presented by Cash App,
00:02:17.800 | the number one finance app in the App Store.
00:02:20.280 | I personally use Cash App to send money to friends,
00:02:22.960 | but you can also use it to buy, sell,
00:02:24.680 | and deposit Bitcoin in just seconds.
00:02:27.120 | Cash App also has a new investing feature.
00:02:30.320 | You can buy fractions of a stock,
00:02:32.040 | say $1 worth, no matter what the stock price is.
00:02:35.760 | Brokerage services are provided by Cash App Investing,
00:02:38.640 | a subsidiary of Square and member of SIPC.
00:02:42.360 | I'm excited to be working with Cash App
00:02:44.400 | to support one of my favorite organizations called FIRST,
00:02:47.520 | best known for their FIRST Robotics and Lego competitions.
00:02:50.880 | They educate and inspire hundreds of thousands of students
00:02:54.320 | in over 110 countries
00:02:56.240 | and have a perfect rating on Charity Navigator,
00:02:58.800 | which means the donated money is used
00:03:00.880 | to maximum effectiveness.
00:03:03.440 | When you get Cash App from the App Store, Google Play,
00:03:06.360 | and use code LEXPODCAST, you'll get $10,
00:03:10.240 | and Cash App will also donate $10 to FIRST,
00:03:13.240 | which again is an organization
00:03:15.080 | that I've personally seen inspire girls and boys
00:03:18.120 | to dream of engineering a better world.
00:03:20.880 | This podcast is also supported by ZipRecruiter.
00:03:25.560 | Hiring great people is hard,
00:03:27.720 | and to me is one of the most important elements
00:03:30.440 | of a successful mission-driven team.
00:03:32.800 | I've been fortunate to be a part of
00:03:34.760 | and lead several great engineering teams.
00:03:37.400 | The hiring I've done in the past
00:03:39.000 | was mostly through tools we built ourselves,
00:03:41.920 | but reinventing the wheel was painful.
00:03:44.160 | ZipRecruiter is a tool that's already available for you.
00:03:47.240 | It seeks to make hiring simple, fast, and smart.
00:03:50.720 | For example, Codable co-founder Gretchen Huebner
00:03:54.080 | used ZipRecruiter to find a new game artist
00:03:56.360 | to join our education tech company.
00:03:58.640 | By using ZipRecruiter's screening questions
00:04:00.720 | to filter candidates,
00:04:02.160 | Gretchen found it easier to focus on the best candidates,
00:04:05.080 | and finally, hiring the perfect person for the role
00:04:08.160 | in less than two weeks from start to finish.
00:04:11.480 | ZipRecruiter, the smartest way to hire.
00:04:15.000 | See why ZipRecruiter is effective
00:04:16.640 | for businesses of all sizes by signing up, as I did,
00:04:20.280 | for free at ziprecruiter.com/lexpod.
00:04:24.600 | That's ziprecruiter.com/lexpod.
00:04:28.360 | And now, here's my conversation with Rohit Prasad.
00:04:33.400 | In the movie "Her," I'm not sure if you've ever seen it,
00:04:37.600 | human falls in love with the voice of an AI system.
00:04:41.200 | Let's start at the highest philosophical level
00:04:43.400 | before we get to deep learning and some of the fun things.
00:04:46.600 | Do you think this, what the movie "Her" shows,
00:04:49.400 | is within our reach?
00:04:50.560 | - I think, not specifically about "Her,"
00:04:55.560 | but I think what we are seeing is a massive increase
00:05:00.240 | in adoption of AI assistance, or AI,
00:05:03.440 | in all parts of our social fabric.
00:05:06.760 | And I think it's, what I do believe
00:05:09.960 | is that the utility these AIs provide
00:05:13.600 | and some of the functionalities that are shown
00:05:16.720 | are absolutely within reach.
00:05:18.520 | - So some of the functionality
00:05:21.680 | in terms of the interactive elements,
00:05:23.760 | but in terms of the deep connection
00:05:26.800 | that's purely voice-based,
00:05:28.920 | do you think such a close connection is possible
00:05:31.320 | with voice alone?
00:05:32.760 | - It's been a while since I saw "Her,"
00:05:34.400 | but I would say in terms of interactions
00:05:38.960 | which are both human-like and in these AI assistance,
00:05:42.120 | you have to value what is also superhuman.
00:05:45.520 | We as humans can be in only one place.
00:05:49.400 | AI assistance can be in multiple places at the same time,
00:05:52.880 | one with you on your mobile device,
00:05:55.320 | one at your home, one at work.
00:05:58.000 | So you have to respect these superhuman capabilities too.
00:06:00.960 | Plus, as humans, we have certain attributes
00:06:04.720 | we're very good at, very good at reasoning.
00:06:06.720 | AI assistance, not yet there,
00:06:08.920 | but in the realm of AI assistance,
00:06:11.640 | what they're great at is computation, memory.
00:06:14.000 | It's infinite and pure.
00:06:15.960 | These are the attributes you have to start respecting.
00:06:17.760 | So I think the comparison with human-like
00:06:19.680 | versus the other aspect, which is also superhuman,
00:06:22.800 | has to be taken into consideration.
00:06:24.240 | So I think we need to elevate the discussion
00:06:26.760 | to not just human-like.
00:06:28.560 | - So there's certainly elements where you just mentioned,
00:06:31.800 | Alexa is everywhere, computation is speaking.
00:06:35.240 | So this is a much bigger infrastructure
00:06:37.040 | than just the thing that sits there in the room with you.
00:06:39.960 | But it certainly feels, to us mere humans,
00:06:44.560 | that there's just another little creature there
00:06:49.000 | when you're interacting with it.
00:06:49.920 | You're not interacting with the entirety
00:06:51.440 | of the infrastructure, you're interacting with the device.
00:06:53.880 | The feeling is, okay, sure, we anthropomorphize things,
00:06:58.080 | but that feeling is still there.
00:07:00.240 | So what do you think we, as humans,
00:07:03.640 | the purity of the interaction with a smart assistant,
00:07:06.680 | what do you think we look for in that interaction?
00:07:10.200 | - I think in the certain interactions,
00:07:12.240 | I think will be very much where it does feel like a human,
00:07:15.920 | because it has a persona of its own.
00:07:18.200 | And in certain ones, it wouldn't be.
00:07:20.680 | So I think a simple example to think of it
00:07:23.080 | is if you're walking through the house
00:07:25.200 | and you just want to turn on your lights on and off,
00:07:27.960 | and you're issuing a command,
00:07:29.840 | that's not very much like a human-like interaction.
00:07:32.040 | And that's where the AI shouldn't come back
00:07:33.840 | and have a conversation with you.
00:07:35.240 | Just, it should simply complete that command.
00:07:38.480 | So those, I think the blend of,
00:07:40.200 | we have to think about this as not human-human alone.
00:07:43.240 | It is a human-machine interaction,
00:07:45.080 | and certain aspects of humans are needed,
00:07:48.160 | and certain aspects and situations
00:07:49.920 | demand it to be like a machine.
00:07:51.640 | - So I told you, it's going to be philosophical in parts.
00:07:55.040 | What's the difference between human and machine
00:07:57.480 | in that interaction?
00:07:58.660 | When we interact, two humans,
00:08:00.760 | especially those are friends and loved ones,
00:08:04.000 | versus you and a machine that you also are close with.
00:08:09.000 | - I think you have to think about the roles the AI plays.
00:08:14.040 | And it differs from different customer to customer,
00:08:16.240 | different situation to situation.
00:08:17.980 | Especially I can speak from Alexa's perspective.
00:08:21.480 | It is a companion, a friend at times,
00:08:24.960 | an assistant, and an advisor down the line.
00:08:27.480 | So I think most AIs will have this kind of attributes,
00:08:31.200 | and it will be very situational in nature.
00:08:33.000 | So where is the boundary?
00:08:34.640 | I think the boundary depends on exact context
00:08:37.080 | in which you're interacting with the AI.
00:08:39.280 | - So the depth and the richness
00:08:41.200 | of natural language conversation has been,
00:08:44.480 | by Alan Turing, been used to try to define
00:08:48.120 | what it means to be intelligent.
00:08:50.080 | You know, there's a lot of criticism of that kind of test,
00:08:52.260 | but what do you think is a good test of intelligence,
00:08:55.800 | in your view, in the context of the Turing test?
00:08:58.360 | And Alexa, with the Alexa Prize,
00:09:01.800 | this whole realm, do you think about this
00:09:05.320 | human intelligence, what it means to define it,
00:09:08.200 | what it means to reach that level?
00:09:10.120 | - I do think the ability to converse
00:09:12.520 | is a sign of an ultimate intelligence.
00:09:15.200 | I think that there's no question about it.
00:09:17.480 | So if you think about all aspects of humans,
00:09:20.600 | there are sensors we have,
00:09:22.860 | and those are basically a data collection mechanism.
00:09:26.440 | And based on that, we make some decisions
00:09:28.240 | with our sensory brains, right?
00:09:30.600 | And from that perspective, I think there are elements
00:09:34.480 | we have to talk about how we sense the world,
00:09:37.120 | and then how we act based on what we sense.
00:09:40.360 | Those elements clearly machines have.
00:09:43.680 | But then there's the other aspects of computation
00:09:46.820 | that is way better.
00:09:48.400 | I also mentioned about memory, again,
00:09:50.080 | in terms of being near infinite,
00:09:51.920 | depending on the storage capacity you have.
00:09:54.240 | And the retrieval can be extremely fast and pure,
00:09:58.200 | in terms of like, there's no ambiguity
00:09:59.640 | of who did I see when, right?
00:10:02.080 | I mean, machines can remember that quite well.
00:10:04.480 | So again, on a philosophical level,
00:10:06.880 | I do subscribe to the fact that to be able to converse,
00:10:10.860 | and as part of that, to be able to reason
00:10:13.440 | based on the world knowledge you've acquired,
00:10:15.260 | and the sensory knowledge that is there,
00:10:18.340 | is definitely very much the essence of intelligence.
00:10:22.100 | But intelligence can go beyond human level intelligence
00:10:26.960 | based on what machines are getting capable of.
00:10:29.800 | - So what do you think, maybe stepping outside of Alexa,
00:10:33.440 | broadly as an AI field,
00:10:35.760 | what do you think is a good test of intelligence?
00:10:38.720 | Put it another way, outside of Alexa,
00:10:41.200 | because so much of Alexa is a product,
00:10:43.040 | is an experience for the customer.
00:10:44.920 | On the research side, what would impress the heck out of you
00:10:47.960 | if you saw, you know, what is the test where you said,
00:10:50.800 | wow, this thing is now starting to encroach
00:10:56.720 | into the realm of what we loosely think
00:10:59.040 | of as human intelligence?
00:11:00.360 | - So, well, we think of it as AGI
00:11:02.400 | and human intelligence all together, right?
00:11:04.360 | So in some sense, and I think we are quite far from that.
00:11:08.000 | I think an unbiased view I have
00:11:11.480 | is that the Alexa's intelligence capability is a great test.
00:11:16.480 | I think of it as, there are many other proof points,
00:11:20.600 | like self-driving cars, game playing, like Go or chess.
00:11:26.300 | Let's take those two as an example.
00:11:28.660 | Clearly requires a lot of data-driven learning
00:11:31.780 | and intelligence, but it's not as hard a problem
00:11:35.100 | as conversing with, as an AI is with humans
00:11:39.740 | to accomplish certain tasks or open domain chat,
00:11:42.340 | as you mentioned, Alexa Prize.
00:11:43.980 | In those settings, the key difference is that
00:11:48.180 | the end goal is not defined, unlike game playing.
00:11:51.900 | You also do not know exactly what state you are in
00:11:55.720 | in a particular goal completion scenario.
00:11:58.960 | In certain sense, sometimes you can, if it is a simple goal,
00:12:02.100 | but if you're, even certain examples like planning a weekend
00:12:05.620 | or you can imagine how many things change along the way.
00:12:09.900 | You look for weather, you may change your mind
00:12:11.940 | and you change the destination,
00:12:14.860 | or you want to catch a particular event,
00:12:17.060 | and then you decide, no, I want this other event
00:12:19.420 | I want to go to.
00:12:20.540 | So these dimensions of how many different steps
00:12:24.020 | are possible when you're conversing as a human
00:12:26.380 | with a machine makes it an extremely daunting problem.
00:12:29.140 | And I think it is the ultimate test for intelligence.
00:12:32.380 | - And don't you think that natural language
00:12:35.700 | is enough to prove that conversation,
00:12:39.020 | just pure conversation?
00:12:40.420 | - From a scientific standpoint,
00:12:42.340 | natural language is a great test, but I would go beyond,
00:12:46.500 | I don't want to limit it to as natural language
00:12:48.760 | as simply understanding an intent
00:12:51.100 | or parsing for entities and so forth.
00:12:52.780 | We are really talking about dialogue.
00:12:54.900 | - Dialogue.
00:12:55.740 | - So I would say human machine dialogue
00:12:58.500 | is definitely one of the best tests of intelligence.
00:13:02.980 | - So can you briefly speak to the Alexa Prize
00:13:06.680 | for people who are not familiar with it,
00:13:08.660 | and also just maybe where things stand
00:13:12.660 | and what have you learned and what's surprising?
00:13:15.420 | What have you seen that's surprising
00:13:16.900 | from this incredible competition?
00:13:18.460 | - Absolutely, it's a very exciting competition.
00:13:20.960 | Alexa Prize is essentially a grand challenge
00:13:24.040 | in conversational artificial intelligence,
00:13:26.880 | where we threw the gauntlet to the universities
00:13:29.420 | who do active research in the field to say,
00:13:32.380 | can you build what we call a social bot
00:13:35.320 | that can converse with you coherently
00:13:37.320 | and engagingly for 20 minutes?
00:13:39.800 | That is an extremely hard challenge
00:13:42.480 | talking to someone who you're meeting for the first time,
00:13:46.460 | or even if you've met them quite often,
00:13:49.640 | to speak at 20 minutes on any topic,
00:13:53.560 | an evolving nature of topics is super hard.
00:13:57.760 | We have completed two successful years of the competition.
00:14:01.600 | The first was one with the University of Washington,
00:14:03.400 | second, the University of California.
00:14:05.560 | We are in our third instance.
00:14:06.880 | We have an extremely strong team of 10 cohorts,
00:14:09.640 | and the third instance of the Alexa Prize is underway now.
00:14:14.820 | And we are seeing a constant evolution.
00:14:17.480 | First year was definitely a learning.
00:14:18.920 | It was a lot of things to be put together.
00:14:21.160 | We had to build a lot of infrastructure
00:14:23.640 | to enable these universities
00:14:25.960 | to be able to build magical experiences
00:14:28.320 | and do high quality research.
00:14:31.560 | - Just a few quick questions, sorry for the interruption.
00:14:33.900 | What does failure look like in the 20 minute session?
00:14:37.260 | So what does it mean to fail
00:14:38.720 | not to reach the 20 minute mark?
00:14:39.960 | - Oh, awesome question.
00:14:41.240 | So there are one, first of all,
00:14:43.360 | I forgot to mention one more detail.
00:14:45.380 | It's not just 20 minutes,
00:14:46.560 | but the quality of the conversation too that matters.
00:14:49.320 | And the beauty of this competition,
00:14:51.480 | before I answer that question on what failure means,
00:14:53.800 | is first that you actually converse
00:14:56.600 | with millions and millions of customers
00:14:59.000 | as the social bots.
00:15:00.840 | So during the judging phases, there are multiple phases,
00:15:05.000 | before we get to the finals,
00:15:06.320 | which is a very controlled judging
00:15:07.960 | in a situation where we bring in judges
00:15:10.400 | and we have interactors who interact
00:15:12.480 | with these social bots.
00:15:14.400 | That is a much more controlled setting,
00:15:15.920 | but till the point we get to the finals,
00:15:18.960 | all the judging is essentially by the customers of Alexa.
00:15:22.720 | And there you basically rate on a simple question,
00:15:26.200 | how good your experience was.
00:15:28.480 | So that's where we are not testing
00:15:29.900 | for a 20 minute boundary being crossed,
00:15:32.800 | because you do want it to be very much
00:15:34.880 | like a clear cut winner be chosen
00:15:37.880 | and it's an absolute bar.
00:15:40.120 | So did you really break that 20 minute barrier
00:15:42.840 | is why we have to test it in a more controlled setting
00:15:45.960 | with actors, essentially interactors,
00:15:48.680 | and see how the conversation goes.
00:15:50.840 | So this is why it's a subtle difference
00:15:54.180 | between how it's being tested in the field
00:15:57.040 | with real customers versus in the lab to award the prize.
00:16:00.500 | So on the latter one, what it means is that
00:16:03.560 | essentially there are three judges
00:16:08.040 | and two of them have to say this conversation
00:16:10.400 | has stalled essentially.
00:16:11.780 | - Got it, and the judges are human experts.
00:16:15.840 | - Judges are human experts.
00:16:17.000 | - Okay, great.
00:16:17.840 | So this is in the third year.
00:16:19.140 | So what's been the evolution?
00:16:20.920 | How far, so the DARPA challenge in the first year,
00:16:24.640 | the autonomous vehicles,
00:16:25.760 | nobody finished in the second year,
00:16:27.760 | a few more finished in the desert.
00:16:29.700 | So how far along in this,
00:16:33.280 | I would say much harder challenge are we?
00:16:36.360 | - This challenge has come a long way
00:16:37.720 | to the extent that we're definitely not close
00:16:40.440 | to the 20 minute barrier being with coherence
00:16:42.720 | and engaging conversation.
00:16:44.720 | I think we are still five to 10 years away
00:16:46.840 | in that horizon to complete that.
00:16:48.640 | But the progress is immense.
00:16:51.360 | Like what you're finding is the accuracy
00:16:54.080 | and what kind of responses these social bots generate
00:16:57.360 | is getting better and better.
00:16:59.160 | What's even amazing to see that now there's humor coming in.
00:17:03.320 | The bots are quite--
00:17:04.920 | - Awesome.
00:17:05.760 | (laughs)
00:17:06.600 | - You're talking about ultimate science of intelligence.
00:17:09.440 | I think humor is a very high bar
00:17:11.840 | in terms of what it takes to create humor.
00:17:14.920 | And I don't mean just being goofy.
00:17:16.520 | I really mean good sense of humor
00:17:19.440 | is also a sign of intelligence in my mind
00:17:21.600 | and something very hard to do.
00:17:23.120 | So these social bots are now exploring
00:17:25.040 | not only what we think of natural language abilities,
00:17:28.560 | but also personality attributes
00:17:30.360 | and aspects of when to inject an appropriate joke,
00:17:35.520 | when you don't know the domain,
00:17:38.400 | how you come back with something more intelligible
00:17:41.360 | so that you can continue the conversation.
00:17:43.160 | If you and I are talking about AI
00:17:45.160 | and we are domain experts, we can speak to it.
00:17:47.480 | But if you suddenly switch a topic to that I don't know of,
00:17:50.480 | how do I change the conversation?
00:17:52.120 | So you're starting to notice these elements as well.
00:17:55.200 | And that's coming from partly by the nature
00:17:58.520 | of the 20 minute challenge
00:18:00.120 | that people are getting quite clever
00:18:02.520 | on how to really converse
00:18:05.600 | and essentially mask some of the understanding defects
00:18:08.600 | if they exist.
00:18:09.840 | - So some of this, this is not Alexa the product.
00:18:12.680 | This is somewhat for fun,
00:18:15.640 | for research, for innovation and so on.
00:18:17.800 | I have a question sort of in this modern era,
00:18:20.280 | there's a lot of, if you look at Twitter
00:18:23.440 | and Facebook and so on, there's discourse,
00:18:25.800 | public discourse going on.
00:18:27.200 | And some things are a little bit too edgy,
00:18:28.800 | people get blocked and so on.
00:18:30.640 | I'm just out of curiosity.
00:18:32.280 | Are people in this context pushing the limits?
00:18:35.960 | Is anyone using the F word?
00:18:37.720 | Is anyone sort of pushing back,
00:18:41.400 | sort of arguing, I guess I should say,
00:18:45.960 | as part of the dialogue to really draw people in?
00:18:48.280 | - First of all, let me just back up a bit
00:18:50.320 | in terms of why we are doing this, right?
00:18:52.120 | So you said it's fun.
00:18:54.280 | I think fun is more part of the engaging part for customers.
00:18:59.920 | It is one of the most used skills as well
00:19:02.480 | in our skill store.
00:19:04.360 | But up that apart, the real goal was essentially
00:19:07.200 | what was happening is with a lot of AI research
00:19:10.400 | moving to industry, we felt that academia
00:19:13.560 | has the risk of not being able to have the same resources
00:19:16.800 | at disposal that we have, which is lots of data,
00:19:20.480 | massive computing power, and clear ways
00:19:24.640 | to test these AI advances with real customer benefits.
00:19:28.520 | So we brought all these three together in the Alexa Prize.
00:19:30.880 | That's why it's one of my favorite projects in Amazon.
00:19:33.880 | And with that, the secondary effect is,
00:19:37.520 | yes, it has become engaging for our customers as well.
00:19:40.960 | We're not there in terms of where we want it to be, right?
00:19:43.920 | But it's a huge progress.
00:19:45.080 | But coming back to your question on
00:19:47.120 | how do the conversations evolve?
00:19:48.840 | Yes, there is some natural attributes
00:19:51.040 | of what you said in terms of argument
00:19:52.800 | and some amount of swearing.
00:19:54.200 | The way we take care of that is that
00:19:56.800 | there is a sensitive filter we have built.
00:19:59.120 | - Certain keywords and so on. - It's more than keywords.
00:20:01.400 | A little more in terms of,
00:20:03.520 | of course, there's keyword-based too,
00:20:04.920 | but there's more in terms of,
00:20:06.960 | these words can be very contextual, as you can see,
00:20:09.480 | and also the topic can be something
00:20:12.640 | that you don't want a conversation to happen
00:20:15.440 | because this is a communal device as well.
00:20:17.320 | A lot of people use these devices.
00:20:19.280 | So we have put a lot of guardrails
00:20:21.800 | for the conversation to be more useful for advancing AI
00:20:25.960 | and not so much of these other issues you attributed,
00:20:30.960 | what's happening in the AI field as well.
00:20:32.920 | - Right, so this is actually a serious opportunity.
00:20:35.320 | I didn't use the right word, fun.
00:20:36.920 | I think it's an open opportunity to do
00:20:40.000 | some of the best innovation
00:20:42.040 | in conversational agents in the world.
00:20:44.800 | - Absolutely.
00:20:45.960 | - Why just universities?
00:20:49.040 | - Oh, why just universities?
00:20:49.920 | Because as I said, I really felt-
00:20:51.560 | - Young minds?
00:20:52.400 | - Young minds.
00:20:53.240 | It's also, if you think about the other aspect
00:20:57.960 | of where the whole industry is moving with AI,
00:21:01.440 | there's a dearth of talent given the demands.
00:21:04.920 | So you do want universities to have a clear place
00:21:09.920 | where they can invent and research and not fall behind
00:21:12.520 | with that they can't motivate students.
00:21:13.960 | Imagine all grad students left to industry like us
00:21:18.960 | or faculty members, which has happened too.
00:21:22.920 | So this is a way that if you're so passionate
00:21:25.200 | about the field where you feel industry
00:21:28.040 | and academia need to work well,
00:21:29.760 | this is a great example and a great way
00:21:32.880 | for universities to participate.
00:21:34.520 | - So what do you think it takes to build a system
00:21:37.320 | that wins the Alexa Prize?
00:21:39.640 | - I think you have to start focusing
00:21:42.960 | on aspects of reasoning that it is,
00:21:47.960 | there are still more lookups
00:21:50.800 | of what intents customers asking for
00:21:54.200 | and responding to those rather than really reasoning
00:21:58.960 | about the elements of the conversation.
00:22:02.520 | For instance, if you're playing,
00:22:06.280 | if the conversation is about games
00:22:08.160 | and it's about a recent sports event,
00:22:11.280 | there's so much context involved
00:22:13.360 | and you have to understand the entities
00:22:15.840 | that are being mentioned so that the conversation
00:22:19.120 | is coherent rather than you suddenly just switch
00:22:21.600 | to knowing some fact about a sports entity
00:22:25.240 | and you're just relaying that rather
00:22:26.720 | than understanding the true context of the game.
00:22:28.760 | Like if you just said, I learned this fun fact
00:22:32.360 | about Tom Brady rather than really say how he played
00:22:36.960 | the game the previous night,
00:22:39.360 | then the conversation is not really that intelligent.
00:22:42.880 | So you have to go to more reasoning elements
00:22:46.240 | of understanding the context of the dialogue
00:22:49.160 | and giving more appropriate responses,
00:22:51.280 | which tells you that we are still quite far
00:22:53.760 | because a lot of times it's more facts being looked up
00:22:57.440 | and something that's close enough as an answer,
00:22:59.960 | but not really the answer.
00:23:02.080 | So that is where the research needs to go more
00:23:05.080 | and actual true understanding and reasoning.
00:23:08.400 | And that's why I feel it's a great way to do it
00:23:10.480 | because you have an engaged set of users
00:23:13.520 | working to make, help these AI advances happen in this case.
00:23:18.240 | - You mentioned customers there quite a bit
00:23:20.680 | and there's a skill.
00:23:22.160 | What is the experience for the user that's helping?
00:23:26.520 | So just to clarify, this isn't, as far as I understand,
00:23:30.120 | the Alexa, so this skill is a standalone
00:23:32.560 | for the Alexa prize.
00:23:33.600 | I mean, it's focused on the Alexa prize.
00:23:35.360 | It's not you ordering certain things on Amazon.com
00:23:38.080 | or checking the weather or playing Spotify, right?
00:23:40.720 | This is a separate skill.
00:23:42.520 | And so you're focused on helping that.
00:23:45.600 | I don't know, how do people, how do customers think of it?
00:23:48.560 | Are they having fun?
00:23:49.840 | Are they helping teach the system?
00:23:52.080 | What's the experience like?
00:23:53.080 | - I think it's both actually.
00:23:54.680 | And let me tell you how you invoke this skill.
00:23:57.840 | So all you have to say, Alexa, let's chat.
00:24:00.240 | And then the first time you say, Alexa, let's chat,
00:24:03.360 | it comes back with a clear message
00:24:04.760 | that you're interacting with one of those
00:24:06.280 | university social bots.
00:24:08.040 | And there's a clear,
00:24:09.360 | so you know exactly how you interact, right?
00:24:11.840 | And that is why it's very transparent.
00:24:14.080 | You are being asked to help, right?
00:24:16.240 | And we have a lot of mechanisms where as the,
00:24:20.960 | we are in the first phase of feedback phase,
00:24:23.680 | then you send a lot of emails to our customers
00:24:26.720 | and then they know that the team needs a lot of interactions
00:24:31.720 | to improve the accuracy of the system.
00:24:33.920 | So we know we have a lot of customers
00:24:35.880 | who really want to help these university bots
00:24:38.920 | and they're conversing with that.
00:24:40.400 | And some are just having fun
00:24:41.960 | with just saying, Alexa, let's chat.
00:24:43.960 | And also some adversarial behavior to see whether,
00:24:47.280 | how much do you understand as a social bot?
00:24:50.240 | So I think we have a good healthy mix
00:24:52.240 | of all three situations.
00:24:53.880 | - So what is the,
00:24:55.280 | if we talk about solving the Alexa challenge,
00:24:58.000 | the Alexa prize,
00:24:59.040 | what's the data set of really engaging,
00:25:05.480 | pleasant conversations look like?
00:25:07.480 | 'Cause if we think of this as a supervised learning problem,
00:25:10.560 | I don't know if it has to be,
00:25:12.160 | but if it does, maybe you can comment on that.
00:25:15.360 | Do you think there needs to be a data set
00:25:17.440 | of what it means to be an engaging,
00:25:21.120 | successful, fulfilling conversation?
00:25:22.560 | - I think that's part of the research question here.
00:25:24.720 | This was, I think,
00:25:25.920 | we at least got the first spot right,
00:25:29.160 | which is have a way for universities to build
00:25:33.320 | and test in a real world setting.
00:25:35.760 | Now you're asking in terms of the next phase of questions,
00:25:38.560 | which we are also asking, by the way,
00:25:41.040 | what does success look like from a optimization function?
00:25:45.320 | That's what you're asking in terms of,
00:25:47.120 | we as researchers are used to having a great corpus
00:25:49.480 | of annotated data and then making,
00:25:52.560 | then sort of tune our algorithms on those, right?
00:25:57.520 | And fortunately and unfortunately,
00:26:00.560 | in this world of Alexa prize,
00:26:02.840 | that is not the way we are going after it.
00:26:05.320 | So you have to focus more on learning
00:26:07.680 | based on live feedback.
00:26:10.880 | That is another element that's unique where just now,
00:26:15.040 | I started with giving you how you ingress
00:26:17.240 | and experience this capability as a customer.
00:26:21.480 | What happens when you're done?
00:26:23.560 | So they ask you a simple question on a scale of one to five,
00:26:27.480 | how likely are you to interact with this social bot again?
00:26:31.800 | That is a good feedback
00:26:33.800 | and customers can also leave more open-ended feedback.
00:26:37.400 | And I think partly that to me
00:26:40.840 | is one part of the question you're asking,
00:26:42.600 | which I'm saying is a mental model shift
00:26:44.560 | that as researchers also, you have to change your mindset
00:26:48.520 | that this is not a DARPA evaluation or an NSF funded study
00:26:52.640 | and you have a nice corpus.
00:26:54.920 | This is where it's real world, you have real data.
00:26:58.680 | - The scale is amazing.
00:26:59.840 | That's a beautiful thing.
00:27:01.520 | And then the customer, the user can quit the conversation
00:27:05.720 | at any time.
00:27:06.560 | - Exactly, the user can.
00:27:07.400 | That is also a signal for how good you were at that point.
00:27:11.720 | - So, and then on a scale of one to five, one to three,
00:27:15.000 | do they say how likely are you, or is it just a binary?
00:27:17.680 | - One to five.
00:27:18.760 | - One to five.
00:27:20.000 | Wow, okay.
00:27:20.840 | That's such a beautifully constructed challenge, okay.
00:27:23.480 | You said the only way to make a smart assistant really smart
00:27:30.000 | is to give it eyes and let it explore the world.
00:27:32.480 | I'm not sure it might've been taken out of context,
00:27:36.840 | but can you comment on that?
00:27:38.240 | Can you elaborate on that idea?
00:27:40.080 | 'Cause I personally also find that idea super exciting
00:27:43.120 | from a social robotics, personal robotics perspective.
00:27:46.240 | - Yeah, a lot of things do get taken out of context.
00:27:48.840 | This particular one was just as philosophical discussion
00:27:52.040 | we were having on terms of what does intelligence look like?
00:27:55.520 | And the context was in terms of learning,
00:27:59.200 | I think just we said, we as humans are empowered
00:28:03.040 | with many different sensory abilities.
00:28:05.160 | I do believe that eyes are an important aspect of it
00:28:09.560 | in terms of, if you think about how we as humans learn,
00:28:13.680 | it is quite complex, and it's also not unimodal
00:28:18.320 | that you are fed a ton of text or audio,
00:28:22.040 | and you just learn that way.
00:28:23.360 | No, you learn by experience, you learn by seeing,
00:28:27.240 | you're taught by humans,
00:28:30.320 | and we are very efficient in how we learn.
00:28:33.240 | Machines on the contrary are very inefficient
00:28:35.320 | on how they learn, especially these AIs.
00:28:37.640 | I think the next wave of research
00:28:40.800 | is going to be with less data,
00:28:44.360 | not just less human, not just with less labeled data,
00:28:48.240 | but also with a lot of weak supervision,
00:28:51.080 | and where you can increase the learning rate.
00:28:55.160 | I don't mean less data in terms of not having
00:28:57.280 | a lot of data to learn from,
00:28:58.680 | that we are generating so much data,
00:29:00.360 | but it is more about from an aspect
00:29:02.640 | of how fast can you learn.
00:29:04.920 | - So improving the quality of the data
00:29:07.080 | and the learning process.
00:29:09.960 | - I think more on the learning process.
00:29:11.480 | I think we have to, we as humans learn
00:29:13.600 | with a lot of noisy data, right?
00:29:15.720 | And I think that's the part
00:29:18.520 | that I don't think should change.
00:29:21.480 | What should change is how we learn, right?
00:29:23.920 | So if you look at, you mentioned supervised learning,
00:29:26.120 | we have making transformative shifts
00:29:28.000 | from moving to more unsupervised, more weak supervision.
00:29:31.160 | Those are the key aspects of how to learn.
00:29:34.880 | And I think in that setting, I hope you agree with me
00:29:37.800 | that having other senses is very crucial
00:29:41.720 | in terms of how you learn.
00:29:43.520 | - So absolutely, and from a machine learning perspective,
00:29:46.720 | which I hope we get a chance to talk to a few aspects
00:29:49.720 | that are fascinating there,
00:29:51.120 | but to stick on the point of sort of
00:29:54.000 | a body, an embodiment.
00:29:56.280 | So Alexa has a body,
00:29:57.560 | has a very minimalistic, beautiful interface,
00:30:01.640 | or there's a ring and so on.
00:30:02.880 | I mean, I'm not sure of all the flavors
00:30:04.520 | of the devices that Alexa lives on,
00:30:07.600 | but there's a minimalistic, basic interface.
00:30:11.020 | And nevertheless, we humans, so I have a Roomba,
00:30:15.720 | I have all kinds of robots all over everywhere.
00:30:18.280 | So what do you think the Alexa of the future looks like
00:30:23.280 | if it begins to shift what his body looks like?
00:30:29.280 | Maybe beyond Alexa,
00:30:30.680 | what do you think of the different devices in the home
00:30:33.800 | as they start to embody their intelligence more and more?
00:30:36.880 | What do you think that looks like?
00:30:38.120 | Philosophically, a future, what do you think that looks like?
00:30:41.200 | - I think, let's look at what's happening today.
00:30:43.600 | You mentioned, I think, other devices as an Amazon devices,
00:30:46.840 | but I also wanted to point out,
00:30:48.040 | Alexa is already integrated in a lot of third-party devices,
00:30:51.360 | which also come in lots of forms and shapes,
00:30:54.840 | some in robots, right, some in microwaves,
00:30:58.960 | some in appliances that you use in everyday life.
00:31:02.600 | So I think it's not just the shape Alexa takes
00:31:07.600 | in terms of form factors,
00:31:09.160 | but it's also where all it's available.
00:31:13.000 | It's getting in cars,
00:31:14.240 | it's getting in different appliances in homes,
00:31:16.740 | even toothbrushes, right?
00:31:18.720 | So I think you have to think about it
00:31:20.760 | as not a physical assistant.
00:31:25.440 | It will be in some embodiment, as you said,
00:31:28.480 | we already have these nice devices,
00:31:31.120 | but I think it's also important to think of it,
00:31:33.800 | it is a virtual assistant.
00:31:35.640 | It is superhuman in the sense
00:31:37.200 | that it is in multiple places at the same time.
00:31:40.280 | So I think the actual embodiment in some sense,
00:31:45.200 | to me, doesn't matter.
00:31:46.700 | I think you have to think of it as not as human-like
00:31:52.620 | and more of what its capabilities are
00:31:56.100 | that derive a lot of benefit for customers
00:31:58.820 | and how there are different ways to delight customers
00:32:02.060 | and different experiences.
00:32:03.980 | And I think I'm a big fan of it not being just human-like,
00:32:08.980 | it should be human-like in certain situations,
00:32:11.140 | Alexa Price Social Bot in terms of conversation
00:32:13.380 | is a great way to look at it,
00:32:14.900 | but there are other scenarios where human-like,
00:32:18.820 | I think is underselling the abilities of this AI.
00:32:22.080 | - So if I could trivialize what we're talking about.
00:32:26.140 | So if you look at the way Steve Jobs thought
00:32:29.420 | about the interaction with the device that Apple produced,
00:32:33.440 | there was a extreme focus on controlling the experience
00:32:36.780 | by making sure there's only this Apple produced devices.
00:32:40.200 | You see the voice of Alexa
00:32:43.420 | being taking all kinds of forms
00:32:45.620 | depending on what the customers want.
00:32:47.100 | And that means it could be anywhere
00:32:49.900 | from the microwave to a vacuum cleaner,
00:32:52.660 | to the home and so on.
00:32:54.260 | The voice is the essential element of the interaction.
00:32:57.740 | - I think voice is an essence.
00:32:59.780 | It's not all, but it's a key aspect.
00:33:02.180 | I think to your question in terms of
00:33:05.620 | you should be able to recognize Alexa.
00:33:08.180 | And that's a huge problem.
00:33:09.920 | I think in terms of a huge scientific problem,
00:33:11.980 | I should say like, what are the traits?
00:33:13.700 | What makes it look like Alexa,
00:33:16.100 | especially in different settings.
00:33:17.540 | And especially if it's primarily voice what it is,
00:33:20.380 | but Alexa is not just voice either, right?
00:33:22.220 | I mean, we have devices with a screen.
00:33:25.020 | Now you're seeing just other behaviors of Alexa.
00:33:28.500 | So I think we are in very early stages of what that means.
00:33:31.380 | And this will be an important topic for the following years.
00:33:34.780 | But I do believe that being able to recognize
00:33:38.220 | and tell when it's Alexa versus it's not
00:33:40.500 | is going to be important from an Alexa perspective.
00:33:43.380 | I'm not speaking for the entire AI community,
00:33:46.020 | but I think attribution.
00:33:49.460 | And as we go into more of understanding who did what,
00:33:54.460 | that identity of the AI is crucial in the coming world.
00:33:58.780 | - I think from the broad AI community perspective,
00:34:01.100 | that's also a fascinating problem.
00:34:02.900 | So basically if I close my eyes and listen to the voice,
00:34:06.220 | what would it take for me to recognize that this is Alexa?
00:34:08.780 | - Exactly.
00:34:09.620 | - Or at least the Alexa that I've come to know
00:34:11.420 | from my personal experience in my home
00:34:13.820 | through my interactions, that kind of thing.
00:34:15.140 | - Yeah, and the Alexa here in the US is very different
00:34:17.580 | than Alexa in UK and the Alexa in India,
00:34:20.180 | even though they are all speaking English
00:34:22.220 | or the Australian version.
00:34:23.980 | So again, so now think about when you go
00:34:27.340 | into a different culture, a different community,
00:34:29.060 | but you traveled there, what do you recognize Alexa?
00:34:32.460 | I think these are super hard questions actually.
00:34:34.820 | - So there's a team that works on personality.
00:34:37.460 | So if we talk about those different flavors
00:34:40.060 | of what it means culturally speaking, India, UK, US,
00:34:44.020 | what does it mean to add?
00:34:45.580 | So the problem that we just stated, which is fascinating,
00:34:48.460 | how do we make it purely recognizable that it's Alexa?
00:34:52.680 | Assuming that the qualities of the voice are not sufficient,
00:34:57.300 | it's also the content of what is being said.
00:35:01.620 | How do we do that?
00:35:02.740 | How does the personality come into play?
00:35:04.900 | What's that research look like?
00:35:07.420 | I mean, it's such a fascinating subject.
00:35:08.260 | - We have some very fascinating folks
00:35:11.620 | who from both the UX background and human factors
00:35:14.140 | are looking at these aspects and these exact questions.
00:35:17.500 | But I will definitely say it's not just how it sounds,
00:35:21.660 | the choice of words, the tone,
00:35:24.460 | not just, I mean, the voice identity of it,
00:35:26.860 | but the tone matters, the speed matters,
00:35:30.100 | how you speak, how you enunciate words,
00:35:34.020 | what choice of words are you using,
00:35:36.220 | how terse are you or how lengthy in your explanations you are
00:35:40.780 | all of these are factors.
00:35:42.980 | And you also, you mentioned something crucial
00:35:45.500 | that you may have personalized it, Alexa,
00:35:49.180 | to some extent in your homes
00:35:51.420 | or in the devices you are interacting with.
00:35:53.460 | So you as your individual, how you prefer Alexa sounds
00:35:58.460 | can be different than how I prefer.
00:36:01.260 | And we may, and the amount of customized ability
00:36:03.780 | you want to give is also a key debate we always have.
00:36:07.620 | But I do want to point out,
00:36:08.980 | it's more than the voice actor that recorded
00:36:11.500 | and it sounds like that actor.
00:36:13.980 | It is more about the choices of words,
00:36:16.900 | the attributes of tonality, the volume
00:36:19.740 | in terms of how you raise your pitch and so forth.
00:36:22.540 | All of that matters.
00:36:23.820 | - This is such a fascinating problem
00:36:25.420 | from a product perspective.
00:36:27.580 | I could see those debates just happening
00:36:29.460 | inside of the Alexa team of how much personalization
00:36:32.420 | do you do for the specific customer?
00:36:34.380 | 'Cause you're taking a risk if you over personalize
00:36:37.260 | because you don't, if you create a personality
00:36:42.020 | for a million people, you can test that better.
00:36:46.020 | You can create a rich, fulfilling experience
00:36:48.620 | that will do well.
00:36:50.060 | But if the more you personalize it,
00:36:52.260 | the less you can test it,
00:36:53.500 | the less you can know that it's a great experience.
00:36:56.340 | So how much personalization, what's the right balance?
00:36:59.700 | - I think the right balance depends on the customer.
00:37:01.580 | Give them the control.
00:37:02.780 | So I'll say, I think the more control you give customers,
00:37:07.420 | the better it is for everyone.
00:37:09.580 | And I'll give you some key personalization features.
00:37:13.860 | I think we have a feature called Remember This,
00:37:15.860 | which is where you can tell Alexa to remember something.
00:37:19.460 | There you have an explicit sort of control
00:37:23.060 | in customer's hand because they have to say,
00:37:24.580 | Alexa, remember X, Y, Z.
00:37:26.500 | - What kind of things would that be used for?
00:37:27.980 | For a song title or something?
00:37:30.380 | - I have stored my tire specs for my car
00:37:33.260 | because it's so hard to go and find and see what it is
00:37:36.740 | when you're having some issues.
00:37:39.060 | I store my mileage plan numbers
00:37:41.420 | for all the frequent flyer ones
00:37:43.100 | where I'm sometimes just looking at it and it's not handy.
00:37:45.940 | So those are my own personal choices I've made
00:37:49.940 | for Alexa to remember something on my behalf.
00:37:52.300 | So again, I think the choice was be explicit
00:37:56.020 | about how you provide that to a customer as a control.
00:38:00.020 | So I think these are the aspects of what you do.
00:38:03.500 | Like think about where we can use
00:38:06.380 | speaker recognition capabilities that it's,
00:38:08.660 | if you taught Alexa that you are Lex
00:38:12.980 | and this person in your household is person two,
00:38:16.340 | then you can personalize the experiences.
00:38:17.940 | Again, these are very,
00:38:19.140 | in the CX customer experience patterns
00:38:22.860 | are very clear about and transparent
00:38:26.540 | when a personalization action is happening.
00:38:30.020 | And then you have other ways like you go
00:38:32.220 | through explicit control right now through your app
00:38:34.620 | that your multiple service providers, let's say for music,
00:38:38.220 | which one is your preferred one?
00:38:39.460 | So when you say play Sting,
00:38:41.300 | depend on your, whether you have preferred Spotify
00:38:43.820 | or Amazon music or Apple music
00:38:45.700 | that the decision is made where to play it from.
00:38:48.300 | - So what's Alexa's backstory from her perspective?
00:38:52.380 | I remember just asking as probably a lot of us
00:38:58.460 | are just the basic questions about love and so on of Alexa,
00:39:02.420 | just to see what the answer would be.
00:39:03.820 | Just, it feels like there's a little bit of a back,
00:39:07.740 | like there's a,
00:39:08.580 | feels like there's a little bit of personality,
00:39:10.300 | but not too much.
00:39:12.860 | Is Alexa have a metaphysical presence
00:39:17.860 | in this human universe we live in?
00:39:21.900 | Or is it something more ambiguous?
00:39:23.740 | Is there a past?
00:39:25.100 | Is there a birth?
00:39:26.240 | Is there a family kind of?
00:39:28.560 | Idea even for joking purposes and so on.
00:39:31.200 | - I think, well, it does tell you if I think you,
00:39:34.880 | I should double check this,
00:39:35.800 | but if you said, when were you born?
00:39:37.200 | I think we do respond.
00:39:39.040 | I need to double check that,
00:39:40.160 | but I'm pretty positive about it.
00:39:41.520 | - I think you do, 'cause I think I've tested that.
00:39:44.040 | But that's like how,
00:39:46.800 | like I was born in Urbana-Champaign
00:39:49.160 | and whatever the year kind of thing, yeah.
00:39:51.280 | - So on terms of the metaphysical, I think it's early.
00:39:55.760 | Does it have the historic knowledge about herself
00:40:00.400 | to be able to do that?
00:40:01.480 | Maybe.
00:40:02.320 | Have we crossed that boundary?
00:40:03.760 | Not yet, right?
00:40:04.600 | In terms of being, thank you.
00:40:06.560 | Have we thought about it?
00:40:07.600 | Quite a bit, but I wouldn't say
00:40:09.580 | that we have come to a clear decision
00:40:11.540 | in terms of what it should look like.
00:40:13.040 | But you can imagine though,
00:40:15.800 | and I bring this back to the Alexa Prize Social Bot one,
00:40:19.240 | there you will start seeing some of that.
00:40:21.220 | Like these bots have their identity.
00:40:23.480 | And in terms of that, you may find,
00:40:25.800 | this is such a great research topic
00:40:28.480 | that some academia team may think of these problems
00:40:32.200 | and start solving them too.
00:40:34.160 | - So let me ask a question.
00:40:38.920 | It's kind of difficult, I think,
00:40:41.240 | but it feels fascinating to me
00:40:43.360 | 'cause I'm fascinated with psychology.
00:40:45.400 | It feels that the more personality you have,
00:40:48.280 | the more dangerous it is
00:40:50.480 | in terms of a customer perspective, a product.
00:40:54.480 | If you want to create a product that's useful.
00:40:57.120 | By dangerous, I mean creating an experience that upsets me.
00:41:01.380 | And so how do you get that right?
00:41:06.680 | Because if you look at the relationships,
00:41:10.080 | maybe I'm just a screwed up Russian,
00:41:11.800 | but if you look at the human to human relationship,
00:41:15.040 | some of our deepest relationships have fights,
00:41:18.160 | have tension, have the push and pull,
00:41:21.240 | have a little flavor in them.
00:41:22.820 | Do you want to have such flavor
00:41:26.200 | in an interaction with Alexa?
00:41:28.120 | How do you think about that?
00:41:29.480 | - So there's one other common thing that you didn't say,
00:41:32.480 | but we think of it as paramount for any deep relationship.
00:41:36.240 | That's trust.
00:41:37.800 | - Trust, yeah.
00:41:38.640 | - So I think if you trust every attribute you said,
00:41:42.160 | a fight, some tension, is all healthy.
00:41:46.040 | But what is sort of unnegotiable in this instance is trust.
00:41:51.040 | And I think the bar to earn customer trust for AI
00:41:54.440 | is very high, in some sense, more than a human.
00:41:58.000 | It's not just about personal information or your data.
00:42:03.000 | It's also about your actions on a daily basis.
00:42:06.600 | How trustworthy are you in terms of consistency,
00:42:09.400 | in terms of how accurate are you in understanding me?
00:42:12.640 | Like if you're talking to a person on the phone,
00:42:15.160 | if you have a problem with your,
00:42:16.360 | let's say your internet or something,
00:42:17.760 | if the person's not understanding,
00:42:19.160 | you lose trust right away.
00:42:20.520 | You don't want to talk to that person.
00:42:22.560 | That whole example gets amplified by a factor of 10,
00:42:25.920 | because when you're a human interacting with an AI,
00:42:29.760 | you have a certain expectation.
00:42:31.240 | Either you expect it to be very intelligent,
00:42:33.560 | and then you get upset, why is it behaving this way?
00:42:35.920 | Or you expect it to be not so intelligent,
00:42:39.080 | and when it surprises you, you're like,
00:42:40.360 | really, you're trying to be too smart?
00:42:42.480 | So I think we grapple with these hard questions as well,
00:42:45.240 | but I think the key is actions need to be trustworthy
00:42:49.120 | from these AIs, not just about data protection,
00:42:52.160 | your personal information protection,
00:42:54.720 | but also from how accurately it accomplishes
00:42:58.560 | all commands or all interactions.
00:43:01.040 | - Well, it's tough to hear because trust,
00:43:03.560 | you're absolutely right,
00:43:04.440 | but trust is such a high bar with AI systems,
00:43:06.880 | because people, and I see this,
00:43:08.760 | 'cause I work with autonomous vehicles,
00:43:10.920 | the bar that's placed on AI system is unreasonably high.
00:43:14.800 | - Yeah, that is going to be, I agree with you,
00:43:17.400 | and I think of it as--
00:43:19.360 | - A challenge.
00:43:20.480 | - It's a challenge, and it also keeps my job.
00:43:23.000 | (laughing)
00:43:24.880 | So from that perspective, I totally,
00:43:27.520 | I think of it at both sides,
00:43:29.040 | as a customer and as a researcher.
00:43:31.320 | I think as a researcher, yes,
00:43:33.360 | occasionally it will frustrate me that,
00:43:35.000 | why is the bar so high for these AIs?
00:43:38.080 | And as a customer, then I say,
00:43:39.800 | absolutely it has to be that high, right?
00:43:42.080 | So I think that's the trade-off we have to balance,
00:43:45.240 | but doesn't change the fundamentals
00:43:47.760 | that trust has to be earned.
00:43:49.560 | And the question then becomes is,
00:43:52.080 | are we holding the AIs to a different bar
00:43:54.200 | in accuracy and mistakes than we hold humans?
00:43:57.000 | That's going to be a great societal questions
00:43:58.960 | for years to come, I think, for us.
00:44:01.080 | - Well, one of the questions that we grapple
00:44:02.960 | as a society now that I think about a lot,
00:44:06.200 | I think a lot of people in the AI think about a lot,
00:44:08.560 | and Alexis taking on head-on is privacy,
00:44:12.360 | is the reality is us giving over data
00:44:17.360 | to any AI system can be used to enrich our lives
00:44:23.000 | in profound ways.
00:44:25.800 | So if basically any product that does anything awesome
00:44:28.560 | for you, the more data it has,
00:44:31.720 | the more awesome things it can do.
00:44:34.080 | And yet, on the other side,
00:44:37.080 | people imagine the worst case possible scenario
00:44:39.440 | of what can you possibly do with that data.
00:44:42.240 | People, it boils down to trust, as you said before.
00:44:45.680 | There's a fundamental distrust
00:44:47.240 | of in certain groups of governments and so on,
00:44:50.440 | depending on the government, depending on who's in power,
00:44:52.840 | depending on all these kinds of factors.
00:44:55.400 | And so here's Alex in the middle of all of it,
00:44:57.960 | in the home, trying to do good things for the customers.
00:45:02.320 | So how do you think about privacy in this context
00:45:05.000 | of smart assistance in the home?
00:45:06.680 | How do you maintain, how do you earn trust?
00:45:08.640 | - Absolutely, so as you said, trust is the key here.
00:45:12.400 | So you start with trust,
00:45:13.520 | and then privacy is a key aspect of it.
00:45:16.720 | It has to be designed from very beginning about that.
00:45:20.200 | And we believe in two fundamental principles.
00:45:23.880 | One is transparency, and second is control.
00:45:26.840 | So by transparency, I mean,
00:45:28.880 | when we build what is now called smart speaker
00:45:32.080 | or the first echo,
00:45:33.320 | we were quite judicious about making these right trade-offs
00:45:38.360 | on customers' behalf,
00:45:40.120 | that it is pretty clear
00:45:41.880 | when the audio is being sent to cloud.
00:45:44.160 | The light ring comes on
00:45:45.240 | when it has heard you say the word wake word,
00:45:48.240 | and then the streaming happens, right?
00:45:49.720 | So when the light ring comes up,
00:45:51.320 | we also had, we put a physical mute button on it,
00:45:55.480 | just so if you didn't want it to be listening,
00:45:57.880 | even for the wake word,
00:45:58.720 | then you turn the mute button on,
00:46:01.760 | and that disables the microphones.
00:46:04.920 | That's just the first decision
00:46:06.600 | on essentially transparency and control.
00:46:09.720 | Oh, then even when we launched,
00:46:11.720 | we gave the control in the hands of the customers
00:46:13.800 | that you can go and look at
00:46:14.880 | any of your individual utterances that is recorded
00:46:17.760 | and delete them anytime.
00:46:19.560 | And we have kept true to that promise, right?
00:46:22.520 | So, and that is super, again,
00:46:24.960 | a great instance of showing how you have the control.
00:46:29.080 | Then we made it even easier.
00:46:30.440 | You can say Alexa, delete what I said today.
00:46:33.080 | So that is now making it even just more control
00:46:36.880 | in your hands with what's most convenient
00:46:39.360 | about this technology is voice.
00:46:42.000 | You delete it with your voice now.
00:46:44.400 | So these are the types of decisions we continually make.
00:46:48.040 | We just recently launched this feature called,
00:46:51.200 | what we think of it as if you wanted humans
00:46:53.680 | not to review your data,
00:46:55.720 | because you've mentioned supervised learning, right?
00:46:59.120 | So you in supervised learning,
00:47:01.120 | humans have to give some annotation.
00:47:03.760 | And that also is now a feature where you can,
00:47:07.080 | essentially, if you've selected that flag,
00:47:09.280 | your data will not be reviewed by a human.
00:47:11.280 | So these are the types of controls
00:47:13.600 | that we have to constantly offer with customers.
00:47:17.440 | - So why do you think it bothers people so much that,
00:47:22.840 | so everything you just said is really powerful.
00:47:26.840 | So the control, the ability to delete,
00:47:28.360 | 'cause we collect, we have studies here running at MIT
00:47:31.080 | that collects huge amounts of data
00:47:32.720 | and people consent and so on.
00:47:34.820 | The ability to delete that data is really empowering
00:47:38.000 | and almost nobody ever asked to delete it,
00:47:39.980 | but the ability to have that control is really powerful.
00:47:44.160 | But still, there's these popular anecdotes,
00:47:47.040 | anecdotal evidence that people say,
00:47:49.280 | they like to tell that them and a friend
00:47:51.500 | were talking about something, I don't know,
00:47:53.700 | sweaters for cats.
00:47:56.080 | And all of a sudden, they'll have advertisements
00:47:58.160 | for cat sweaters on Amazon.
00:48:01.360 | That's a popular anecdote,
00:48:02.640 | as if something is always listening.
00:48:04.480 | Can you explain that anecdote,
00:48:07.760 | that experience that people have?
00:48:09.080 | What's the psychology of that?
00:48:10.920 | What's that experience?
00:48:13.000 | And can you, you've answered it,
00:48:15.040 | but let me just ask, is Alexa listening?
00:48:18.240 | - No, Alexa listens only for the wake word
00:48:21.320 | on the device, right?
00:48:22.520 | - And the wake word is?
00:48:23.880 | - The words like Alexa, Amazon, Echo,
00:48:28.040 | but you only choose one at a time.
00:48:29.640 | So you choose one and it listens only for that
00:48:31.960 | on our devices.
00:48:33.000 | So that's first.
00:48:35.160 | From a listening perspective,
00:48:36.480 | we have to be very clear that it's just the wake word.
00:48:38.360 | So you said, why is there this anxiety, if you may?
00:48:41.280 | - Yeah, exactly.
00:48:42.120 | - It's because there's a lot of confusion,
00:48:43.560 | what it really listens to, right?
00:48:45.320 | And I think it's partly on us to keep educating
00:48:48.760 | our customers and the general media more
00:48:52.240 | in terms of like how, what really happens
00:48:54.080 | and we've done a lot of it.
00:48:56.600 | And our pages on information are clear,
00:49:00.800 | but still people have to have more,
00:49:04.000 | there's always a hunger for information and clarity.
00:49:06.640 | And we'll constantly look at how best to communicate.
00:49:09.080 | If you go back and read everything,
00:49:10.520 | yes, it states exactly that.
00:49:12.240 | And then people could still question it.
00:49:15.320 | And I think that's absolutely okay to question.
00:49:17.720 | What we have to make sure is that we are,
00:49:21.720 | because our fundamental philosophy is customer first,
00:49:24.840 | customer obsession is our leadership principle.
00:49:27.240 | If you put as researchers,
00:49:30.040 | I put myself in the shoes of the customer
00:49:33.160 | and all decisions in Amazon are made with that.
00:49:35.840 | And trust has to be earned
00:49:38.000 | and we have to keep earning the trust of our customers
00:49:40.200 | in this setting.
00:49:41.400 | And to your other point on like,
00:49:44.040 | is there something showing up based on your conversations?
00:49:46.640 | No, I think the answer is like you,
00:49:49.600 | a lot of times when those experiences happen,
00:49:51.360 | you have to also know that, okay, it may be a winter season.
00:49:54.560 | People are looking for sweaters, right?
00:49:56.480 | And it shows up on your amazon.com because it is popular.
00:49:59.680 | So there are many of these,
00:50:01.480 | you mentioned that personality or personalization,
00:50:06.360 | turns out we are not that unique either, right?
00:50:09.160 | So those things we as humans start thinking,
00:50:12.120 | oh, must be because something was heard
00:50:14.160 | and that's why this other thing showed up.
00:50:16.760 | The answer is no,
00:50:17.800 | probably it is just the season for sweaters.
00:50:21.560 | - I'm not gonna ask you this question
00:50:23.840 | 'cause it's just 'cause you're also,
00:50:25.880 | 'cause people have so much paranoia.
00:50:27.200 | But for my, let me just say from my perspective,
00:50:29.240 | I hope there's a day when customer can ask Alexa
00:50:33.200 | to listen all the time, to improve the experience,
00:50:36.680 | to improve, because I personally don't see the negative
00:50:39.840 | because if you have the control and if you have the trust,
00:50:43.960 | there's no reason why you shouldn't be listening
00:50:45.680 | all the time to the conversations to learn more about you.
00:50:48.340 | Because ultimately, as long as you have control and trust,
00:50:53.860 | every data you provide to the device,
00:50:56.940 | that the device wants, is going to be useful.
00:51:01.460 | And so to me, as a machine learning person,
00:51:05.140 | I think it worries me how sensitive people are
00:51:09.540 | about their data relative to how empowering it could be
00:51:14.540 | for the devices around them,
00:51:21.180 | how enriching it could be for their own life
00:51:23.740 | to improve the product.
00:51:25.460 | So I just, it's something I think about sort of a lot,
00:51:28.340 | how do we make that devices,
00:51:29.580 | obviously Alexa thinks about it a lot as well.
00:51:32.260 | I don't know if you wanna comment on that.
00:51:34.260 | So have you seen, let me ask it in the form of a question.
00:51:37.180 | Have you seen an evolution in the way people think about
00:51:42.260 | their private data in the previous several years?
00:51:46.420 | So as we as a society get more and more comfortable
00:51:48.740 | with the data, how do we get more and more comfortable
00:51:51.540 | with the benefits we get by sharing more data?
00:51:55.300 | - First, let me answer that part.
00:51:57.780 | And then I'll wanna go back to the other aspect
00:51:59.620 | you were mentioning.
00:52:01.220 | So as a society, on a general,
00:52:03.860 | we are getting more comfortable as a society.
00:52:05.700 | It doesn't mean that everyone is.
00:52:08.660 | And I think we have to respect that.
00:52:10.260 | I don't think one size fits all
00:52:12.940 | is always gonna be the answer for all, right?
00:52:16.340 | By definition.
00:52:17.180 | Going back to your on what more magical experiences
00:52:22.180 | can be launched in these kinds of AI settings.
00:52:26.060 | I think again, if you give the control,
00:52:28.620 | it's possible certain parts of it.
00:52:32.060 | So we have a feature called follow-up mode
00:52:33.940 | where if you turn it on and Alexa,
00:52:38.300 | after you've spoken to it, will open the mics again,
00:52:42.020 | thinking you will answer something again.
00:52:44.660 | Like if you're adding lists to your shopping item,
00:52:48.540 | shopping list or to-do list, you're not done.
00:52:51.420 | You want to keep.
00:52:52.260 | So in that setting, it's awesome
00:52:53.620 | that it opens the mic for you to say,
00:52:55.580 | eggs and milk and then bread, right?
00:52:57.140 | So these are the kinds of things which you can empower.
00:52:59.900 | So, and then another feature we have,
00:53:02.300 | which is called Alexa Guard.
00:53:04.980 | I said, it only listens for the wake word, all right?
00:53:07.780 | But if you have, let's say you're going to say,
00:53:11.220 | you leave your home and you want Alexa to listen
00:53:13.460 | for a couple of sound events,
00:53:15.020 | like smoke alarm going off
00:53:17.220 | or someone breaking your glass, right?
00:53:19.300 | So it's like just to keep your peace of mind.
00:53:22.180 | So you can say Alexa on guard or I'm away
00:53:26.500 | and then it can be listening for these sound events.
00:53:29.220 | And when you're home, you come out of that mode, right?
00:53:33.020 | So this is another one where you again gave controls
00:53:35.540 | in the hands of the user or the customer
00:53:38.060 | and to enable some experience that is high utility
00:53:42.460 | and maybe even more delightful in the certain settings
00:53:44.620 | like follow up mode and so forth.
00:53:46.500 | Again, this general principle is the same,
00:53:48.900 | control in the hands of the customer.
00:53:50.780 | - So I know we kind of started with a lot of philosophy
00:53:55.500 | and a lot of interesting topics
00:53:56.860 | and we're just jumping all over the place,
00:53:58.300 | but really some of the fascinating things
00:54:00.300 | that the Alexa team and Amazon is doing
00:54:03.020 | is in the algorithm side, the data side, the technology,
00:54:06.180 | the deep learning, machine learning and so on.
00:54:08.860 | So can you give a brief history of Alexa
00:54:13.060 | from the perspective of just innovation,
00:54:15.460 | the algorithms, the data of how it was born,
00:54:18.660 | how it came to be, how it has grown, where it is today?
00:54:22.260 | - Yeah, it starts with, in Amazon,
00:54:24.340 | everything starts with the customer
00:54:27.020 | and we have a process called working backwards.
00:54:30.340 | Alexa and more specifically than the product Echo,
00:54:35.060 | there was a working backwards document
00:54:36.900 | essentially that reflected what it would be,
00:54:38.900 | started with a very simple vision statement, for instance,
00:54:43.900 | that morphed into a full-fledged document
00:54:47.180 | along the way changed into what all it can do, right?
00:54:51.740 | But the inspiration was the Star Trek computer.
00:54:54.180 | So when you think of it that way,
00:54:56.260 | everything is possible, but when you launch a product,
00:54:58.380 | you have to start with some place.
00:55:01.100 | And when I joined, the product was already in conception
00:55:05.540 | and we started working on the far field speech recognition
00:55:08.940 | because that was the first thing to solve.
00:55:10.980 | By that, we mean that you should be able to speak
00:55:12.860 | to the device from a distance.
00:55:15.260 | And in those days, that wasn't a common practice.
00:55:18.860 | And even in the previous research world I was in
00:55:22.380 | was considered to an unsolvable problem then
00:55:24.620 | in terms of whether you can converse from a length.
00:55:28.340 | And here I'm still talking about the first part
00:55:30.380 | of the problem where you say,
00:55:32.460 | get the attention of the device,
00:55:34.100 | as in by saying what we call the wake word,
00:55:37.140 | which means the word Alexa has to be detected
00:55:40.380 | with a very high accuracy because it is a very common word.
00:55:44.860 | It has sound units that map with words like I like you
00:55:48.260 | or Alec, Alex, right?
00:55:51.140 | So it's a undoubtedly hard problem to detect
00:55:56.140 | the right mentions of Alexa's address to the device
00:56:00.540 | versus I like Alexa.
00:56:02.820 | - So you have to pick up that signal
00:56:04.260 | when there's a lot of noise.
00:56:06.060 | - Not only noise, but a lot of conversation in the house.
00:56:09.460 | Remember on the device,
00:56:10.300 | you're simply listening for the wake word, Alexa.
00:56:13.180 | And there's a lot of words being spoken in the house.
00:56:15.780 | How do you know it's Alexa and directed at Alexa?
00:56:20.780 | Because I could say, I love my Alexa, I hate my Alexa,
00:56:25.300 | I want Alexa to do this.
00:56:26.980 | And in all these three sentences I said Alexa,
00:56:29.260 | I didn't want it to wake up.
00:56:32.100 | - Can I just pause on that second?
00:56:33.780 | What would be your device that I should probably
00:56:36.740 | in the introduction of this conversation give to people
00:56:39.980 | in terms of them turning off their Alexa device
00:56:43.500 | if they're listening to this podcast conversation out loud?
00:56:48.500 | Like what's the probability
00:56:50.580 | that an Alexa device will go off?
00:56:52.300 | Because we mentioned Alexa like a million times.
00:56:55.180 | - So it will, we have done a lot of different things
00:56:58.140 | where we can figure out that there is the device,
00:57:03.140 | the speech is coming from a human versus over the air.
00:57:08.220 | Also, I mean, in terms of like, also it is,
00:57:10.580 | think about ads or, so we also launched a technology
00:57:14.260 | for watermarking kind of approaches
00:57:16.300 | in terms of filtering it out.
00:57:18.820 | But yes, if this kind of a podcast is happening,
00:57:21.620 | it's possible your device will wake up a few times.
00:57:24.380 | It's an unsolved problem,
00:57:25.460 | but it is definitely something we care very much about.
00:57:30.460 | - But the idea is you want to detect Alexa.
00:57:33.980 | - Meant for the device.
00:57:36.140 | - First of all, just even hearing Alexa
00:57:37.580 | versus I like something, I mean, that's a fascinating part.
00:57:41.100 | So that was the first relief.
00:57:43.100 | - That's the first one.
00:57:43.940 | - Built the world's best detector of Alexa.
00:57:46.020 | - Yeah, the world's best wake word detector
00:57:48.780 | in a far field setting,
00:57:49.980 | not like something where the phone is sitting on the table.
00:57:53.900 | This is like people have devices 40 feet away,
00:57:56.740 | like in my house or 20 feet away,
00:57:58.420 | and you still get an answer.
00:58:00.700 | So that was the first part.
00:58:02.500 | The next is, okay, you're speaking to the device.
00:58:05.900 | Of course, you're going to issue many different requests.
00:58:09.020 | Some may be simple, some may be extremely hard,
00:58:11.580 | but it's a large vocabulary speech recognition problem,
00:58:13.780 | essentially, where the audio is now not coming
00:58:17.660 | onto your phone or a handheld mic like this
00:58:20.380 | or a closed talking mic,
00:58:22.100 | but it's from 20 feet away
00:58:23.900 | where if you're in a busy household,
00:58:26.260 | your son may be listening to music,
00:58:28.860 | your daughter may be running around with something
00:58:31.620 | and asking your mom something and so forth.
00:58:33.820 | So this is like a common household setting
00:58:36.380 | where the words you're speaking to Alexa
00:58:40.180 | need to be recognized with very high accuracy.
00:58:43.380 | Now we are still just in the recognition problem.
00:58:45.820 | We haven't yet come to the understanding one.
00:58:48.140 | - And if I pause, I'm sorry, once again,
00:58:50.140 | what year was this?
00:58:51.180 | Is this before neural networks began to start
00:58:55.500 | to seriously prove themselves in the audio space?
00:59:00.540 | - Yeah, this is around, so I joined in 2013, in April.
00:59:05.540 | So the early research in neural networks coming back
00:59:08.940 | and showing some promising results
00:59:11.380 | in speech recognition space had started happening,
00:59:13.700 | but it was very early.
00:59:15.500 | But we just now build on that
00:59:17.940 | on the very first thing we did when I joined the team.
00:59:22.940 | And remember, it was a very much of a startup environment,
00:59:26.060 | which is great about Amazon.
00:59:28.220 | And we doubled on deep learning right away.
00:59:31.380 | And we knew we'll have to improve accuracy fast.
00:59:36.380 | And because of that, we worked on,
00:59:39.100 | and the scale of data,
00:59:40.020 | once you have a device like this, if it is successful,
00:59:43.380 | will improve big time.
00:59:45.060 | Like you'll suddenly have large volumes of data
00:59:48.180 | to learn from to make the customer experience better.
00:59:51.220 | So how do you scale deep learning?
00:59:52.620 | So we did one of the first works
00:59:54.700 | in training with distributed GPUs
00:59:57.740 | and where the training time was linear
01:00:01.580 | in terms of like in the amount of data.
01:00:04.100 | So that was quite important work
01:00:06.380 | where it was algorithmic improvements
01:00:08.020 | as well as a lot of engineering improvements
01:00:10.100 | to be able to train on thousands and thousands of speech.
01:00:14.180 | And that was an important factor.
01:00:15.740 | So if you ask me like back in 2013 and 2014,
01:00:19.460 | when we launched Echo,
01:00:22.580 | the combination of large scale data, deep learning progress,
01:00:27.580 | near infinite GPUs we had available on AWS,
01:00:32.100 | even then, was all came together for us
01:00:34.940 | to be able to solve the far field speech recognition
01:00:38.540 | to the extent it could be useful to the customers.
01:00:40.780 | It's still not solved.
01:00:41.620 | Like, I mean, it's not that we are perfect
01:00:43.140 | at recognizing speech,
01:00:44.620 | but we are great at it in terms of the settings
01:00:46.900 | that are in homes, right?
01:00:48.460 | So, and that was important even in the early stages.
01:00:51.020 | - So first of all, just even,
01:00:52.060 | I'm trying to look back at that time.
01:00:54.340 | If I remember correctly,
01:00:57.100 | it seems like the task would be pretty daunting.
01:01:01.220 | So like, so we kind of take it for granted
01:01:04.460 | that it works now?
01:01:06.380 | - Yes, you're right.
01:01:07.700 | - So let me like how,
01:01:09.780 | first of all, you mentioned startup.
01:01:10.860 | I wasn't familiar how big the team was.
01:01:12.860 | I kind of, 'cause I know there's a lot of
01:01:14.820 | really smart people working on it.
01:01:16.020 | So now it's a very, very large team.
01:01:17.860 | How big was the team?
01:01:20.820 | How likely were you to fail in the eyes of everyone else?
01:01:24.220 | (laughs)
01:01:25.500 | - And ourselves.
01:01:26.340 | (laughs)
01:01:27.180 | - And yourself?
01:01:28.020 | So like what?
01:01:28.860 | - I'll give you a very interesting anecdote on that.
01:01:31.660 | When I joined the team,
01:01:33.940 | the speech recognition team was six people.
01:01:37.740 | My first meeting, and we had hired a few more people,
01:01:40.580 | it was 10 people.
01:01:42.620 | Nine out of 10 people thought it can't be done.
01:01:45.220 | (laughs)
01:01:47.260 | Right?
01:01:48.100 | - Who was the one?
01:01:48.940 | (laughs)
01:01:49.780 | - The one was me.
01:01:50.620 | - Okay.
01:01:51.460 | - Actually I should say,
01:01:52.620 | and one was semi-optimistic.
01:01:54.780 | - Yeah.
01:01:55.620 | - And eight were trying to convince,
01:01:58.740 | let's go to the management and say,
01:02:01.340 | let's not work on this problem.
01:02:03.220 | Let's work on some other problem,
01:02:04.860 | like either telephony speech for customer service calls
01:02:08.620 | and so forth.
01:02:09.780 | But this was the kind of belief you must have.
01:02:11.820 | And I had experience with far field speech recognition
01:02:14.100 | and my eyes lit up when I saw a problem like that saying,
01:02:17.460 | okay, we have been in speech recognition,
01:02:20.540 | always looking for that killer app.
01:02:22.660 | - Yeah.
01:02:23.500 | - And this was a killer use case
01:02:25.540 | to bring something delightful in the hands of customers.
01:02:28.540 | - You mentioned the way you kind of think of it
01:02:30.860 | in the product way in the future,
01:02:32.380 | have a press release and an FAQ and you think backwards.
01:02:34.980 | - That's right.
01:02:35.820 | - Did you have, did the team have the echo in mind
01:02:39.460 | and so this far field speech recognition
01:02:43.100 | actually putting a thing in the home that works,
01:02:45.420 | that it's able to interact with,
01:02:46.700 | was that the press release?
01:02:48.260 | What was the--
01:02:49.100 | - Very close, I would say in terms of the,
01:02:51.500 | as I said, the vision was Star Trek computer, right?
01:02:54.820 | So, or the inspiration.
01:02:56.940 | And from there, I can't divulge all the exact specifications
01:03:00.660 | but one of the first things that was magical on Alexa
01:03:05.660 | was music.
01:03:08.900 | It brought me to back to music
01:03:11.180 | because my taste was still in when I was an undergrad.
01:03:14.180 | So I still listen to those songs
01:03:15.580 | and it was too hard for me to be a music fan with a phone.
01:03:20.580 | Right, so I hate things in my ear.
01:03:24.140 | So from that perspective, it was quite hard
01:03:28.100 | and music was part of the,
01:03:30.540 | at least the documents I've seen, right?
01:03:33.620 | So from that perspective, I think yes,
01:03:36.100 | in terms of how far are we from the original vision?
01:03:40.580 | I can't reveal that,
01:03:42.020 | but that's why I have a ton of fun at work
01:03:44.500 | because every day we go in and thinking like,
01:03:47.180 | these are the new set of challenges to solve.
01:03:49.020 | - Yeah, it's a great way to do great engineering
01:03:51.860 | as you think of the press release.
01:03:53.620 | I like that idea actually.
01:03:54.980 | Maybe we'll talk about it a bit later
01:03:56.780 | but it's just a super nice way to have a focus.
01:03:59.260 | - I'll tell you this, you're a scientist
01:04:01.340 | and a lot of my scientists have adopted that.
01:04:03.700 | They have now, they love it as a process
01:04:06.980 | because it was very, as scientists,
01:04:08.980 | you're trained to write great papers
01:04:10.940 | but they are all after you've done the research
01:04:13.540 | or you've proven and your PhD dissertation proposal
01:04:16.660 | is something that comes closest
01:04:18.460 | or a DARPA proposal or a NSF proposal
01:04:21.220 | is the closest that comes to a press release.
01:04:23.620 | But that process is now ingrained in our scientists
01:04:27.020 | which is like delightful for me to see.
01:04:29.820 | - You write the paper first and then make it happen.
01:04:33.100 | - That's right.
01:04:33.940 | In fact, it's not--
01:04:34.780 | - State of the art results.
01:04:36.300 | - Or you leave the results section open
01:04:38.460 | where you have a thesis about here's what I expect.
01:04:41.660 | And here's what it will change.
01:04:43.460 | So I think it is a great thing.
01:04:46.500 | It works for researchers as well.
01:04:48.180 | - So far field recognition, what was the big leap?
01:04:53.860 | What were the breakthroughs
01:04:55.460 | and what was that journey like to today?
01:04:58.380 | - Yeah, I think the, as you said first,
01:05:00.180 | there was a lot of skepticism
01:05:01.580 | on whether far field speech recognition
01:05:03.340 | will ever work to be good enough.
01:05:05.460 | And what we first did was got a lot of training data
01:05:09.980 | in a far field setting.
01:05:11.460 | And that was extremely hard to get
01:05:14.020 | because none of it existed.
01:05:16.180 | So how do you collect data in far field setup?
01:05:20.060 | - With no customer base.
01:05:21.260 | - With no customer base.
01:05:22.660 | So that was first innovation.
01:05:24.780 | And once we had that, the next thing was,
01:05:26.980 | okay, if you have the data,
01:05:29.740 | first of all, we didn't talk about like,
01:05:31.860 | what would magical mean in this kind of a setting?
01:05:35.260 | What is good enough for customers, right?
01:05:37.500 | That's always, since you've never done this before,
01:05:40.460 | what would be magical?
01:05:41.620 | So it wasn't just a research problem.
01:05:44.220 | You had to put some, in terms of accuracy
01:05:47.660 | and customer experience features,
01:05:49.900 | some stakes on the ground saying,
01:05:51.500 | here's where I think it should get to.
01:05:54.940 | So you established a bar.
01:05:56.020 | And then how do you measure progress
01:05:57.460 | where it is given you have no customers right now?
01:06:01.660 | So from that perspective, we went,
01:06:04.140 | so first was the data without customers.
01:06:07.500 | Second was doubling down on deep learning
01:06:10.500 | as a way to learn.
01:06:11.860 | And I can just tell you that the combination of the two
01:06:16.100 | got our error rates by a factor of five.
01:06:19.140 | From where we were when I started to,
01:06:22.220 | within six months of having that data,
01:06:24.260 | we, at that point, I got the conviction
01:06:28.340 | that this will work, right?
01:06:29.860 | So, because that was magical
01:06:31.580 | in terms of when it started working.
01:06:33.460 | - That reached the magical,
01:06:36.180 | it came close to the magical bar.
01:06:37.580 | - That to the bar, right?
01:06:39.460 | That we felt would be where people will use it,
01:06:44.220 | which was critical.
01:06:45.260 | Because you really have one chance at this.
01:06:48.820 | If we had launched in November 2014 is when we launched,
01:06:51.820 | if it was below the bar,
01:06:53.060 | I don't think this category exists
01:06:56.460 | if you don't meet the bar.
01:06:58.020 | - Yeah, and just having looked at voice-based interactions
01:07:01.980 | like in the car, earlier systems,
01:07:05.940 | it's a source of huge frustration for people.
01:07:08.260 | In fact, we use voice-based interaction
01:07:10.260 | for collecting data on subjects to measure frustration.
01:07:14.540 | So as a training set for computer vision, for face data,
01:07:18.180 | so we can get a data set of frustrated people.
01:07:20.580 | That's the best way to get frustrated people
01:07:22.220 | is having them interact with a voice-based system in the car.
01:07:25.500 | So that bar, I imagine, is pretty high.
01:07:28.500 | - It was very high.
01:07:29.420 | And we talked about how also errors are perceived
01:07:32.660 | from AIs versus errors by humans.
01:07:35.340 | But we are not done with the problems that ended up,
01:07:39.820 | we had to solve to get it to launch.
01:07:41.140 | So do you want the next one?
01:07:42.540 | - Yeah, that was the next one.
01:07:45.620 | - So the next one was what I think of as
01:07:49.460 | multi-domain natural language understanding.
01:07:52.420 | It's very, I wouldn't say easy,
01:07:54.660 | but it is during those days,
01:07:57.420 | solving it, understanding in one domain,
01:08:01.260 | a narrow domain was doable,
01:08:03.940 | but for these multiple domains like music, like information,
01:08:10.020 | other kinds of household productivity, alarms, timers,
01:08:14.060 | even though it wasn't as big as it is
01:08:15.740 | in terms of the number of skills Alexa has
01:08:17.380 | and the confusion space has like grown by
01:08:20.380 | three orders of magnitude,
01:08:22.380 | it was still daunting even those days.
01:08:24.220 | - And again, no customer base yet.
01:08:26.300 | - Again, no customer base.
01:08:27.900 | So now you're looking at meaning understanding
01:08:29.860 | and intent understanding and taking actions
01:08:31.860 | on behalf of customers based on their requests.
01:08:35.060 | And that is the next hard problem.
01:08:37.900 | Even if you have gotten the words recognized,
01:08:41.420 | how do you make sense of them?
01:08:44.060 | In those days, there was still a lot of emphasis
01:08:48.900 | on rule-based systems for writing grammar patterns
01:08:52.300 | to understand the intent,
01:08:53.860 | but we had a statistical first approach even then,
01:08:57.140 | where for a language understanding we had,
01:09:00.020 | even those starting days,
01:09:01.300 | an entity recognizer and an intent classifier,
01:09:05.420 | which was all trained statistically.
01:09:08.100 | In fact, we had to build the deterministic matching
01:09:11.340 | as a follow-up to fix bugs
01:09:14.220 | that statistical models have.
01:09:16.180 | So it was just a different mindset
01:09:18.180 | where we focused on data-driven statistical understanding.
01:09:21.980 | - Wins in the end if you have a huge data set.
01:09:24.660 | - Yes, it is contingent on that.
01:09:26.380 | And that's why it came back to how do you get the data.
01:09:29.060 | Before customers, the fact that this is why data
01:09:32.460 | becomes crucial to get to the point that you have
01:09:37.180 | the understanding system built in, built up.
01:09:40.060 | And notice that for you,
01:09:42.700 | we were talking about human-machine dialogue,
01:09:44.460 | even those early days,
01:09:46.780 | even it was very much transactional,
01:09:49.180 | do one thing, one shot at transits in great way.
01:09:52.460 | There was a lot of debate on how much should Alexa talk back
01:09:54.820 | in terms of if it misunderstood you,
01:09:57.420 | or you said play songs by the Stones,
01:10:01.460 | and let's say it doesn't know, early days,
01:10:04.780 | knowledge can be sparse.
01:10:07.020 | Who are the Stones, right?
01:10:09.300 | The Rolling Stones, right?
01:10:10.780 | So, and you don't want the match to be
01:10:15.460 | Stone Temple Pilots or Rolling Stones, right?
01:10:17.300 | So you don't know which one it is.
01:10:18.940 | So these kind of other signals to,
01:10:22.540 | and out there we had great assets, right?
01:10:24.620 | From Amazon in terms of--
01:10:27.100 | - UX, like what kind of, yeah, how do you solve that problem?
01:10:31.260 | - In terms of what we think of it
01:10:32.340 | as an entity resolution problem, right?
01:10:34.020 | So, because which one is it, right?
01:10:36.220 | I mean, even if you figured out the Stones
01:10:39.060 | as an entity, you have to resolve it
01:10:40.980 | to whether it's the Stones or the Stone Temple Pilots
01:10:43.900 | or some other Stones.
01:10:44.900 | - Maybe I misunderstood, is the resolution
01:10:47.140 | the job of the algorithm, or is the job of UX
01:10:50.580 | communicating with the human to help the resolution?
01:10:52.420 | - Well, there is both, right?
01:10:54.300 | It is, you want 90% or high 90s to be done
01:10:58.820 | without any further questioning or UX, right?
01:11:01.260 | So, but it's absolutely okay, just like as humans,
01:11:05.620 | we ask the question, I didn't understand UX.
01:11:09.020 | It's fine for Alexa to occasionally say,
01:11:10.660 | I did not understand you, right?
01:11:12.100 | And that's an important way to learn.
01:11:14.660 | And I'll talk about where we have come
01:11:16.260 | with more self-learning with these kinds of feedback signals.
01:11:20.140 | But in those days, just solving the ability
01:11:23.300 | of understanding the intent and resolving to an action,
01:11:26.500 | where action could be play a particular artist
01:11:28.780 | or a particular song was super hard.
01:11:31.980 | Again, the bar was high as we were talking about, right?
01:11:35.460 | So while we launched it in sort of 13 big domains,
01:11:40.340 | I would say in terms of, or we think of it as 13,
01:11:43.420 | the big skills we had, like music is a massive one
01:11:46.740 | when we launched it, and now we have
01:11:48.860 | 90,000 plus skills on Alexa.
01:11:51.580 | - So what are the big skills?
01:11:52.740 | Can you just go over them?
01:11:53.580 | Because the only thing I use it for
01:11:55.580 | is music, weather, and shopping.
01:11:57.740 | - So we think of it as music information, right?
01:12:02.620 | So weather is a part of information, right?
01:12:05.500 | So when we launched, we didn't have smart home,
01:12:08.140 | but within, by smart home I mean,
01:12:10.500 | you connect your smart devices,
01:12:12.180 | you control them with voice.
01:12:13.220 | If you haven't done it, it's worth,
01:12:15.140 | it will change your life.
01:12:15.980 | - Like turning on the lights and so on.
01:12:16.820 | - Yeah, turning on your light to do anything
01:12:18.980 | that's connected and has a, it's just that.
01:12:21.620 | - What's your favorite smart device for you?
01:12:23.260 | - My light.
01:12:24.100 | (laughing)
01:12:24.940 | And now you have the smart plug with,
01:12:26.380 | and you don't, we also have this Echo plug, which is.
01:12:29.980 | - Oh yeah, you can plug in anything.
01:12:30.820 | - You can plug anything and now you can turn
01:12:32.580 | that one on and off, right?
01:12:33.420 | - I'll use this conversation motivation
01:12:35.140 | and get one.
01:12:35.980 | - The garage door, you can check your status
01:12:38.780 | of the garage door and things like,
01:12:40.340 | and we have gone, make Alexa more and more proactive
01:12:43.260 | where it even has hunches now that,
01:12:45.660 | or looks, hunches like you left your light on.
01:12:49.220 | Let's say you've gone to your bed
01:12:51.740 | and you left the garage light on.
01:12:52.940 | So yeah, it will help you out in these settings, right?
01:12:57.100 | - That's smart devices.
01:12:58.420 | - Information, smart devices, you said music.
01:13:01.180 | - Yeah, so I don't remember everything we had.
01:13:02.980 | - Yeah, but those are the big ones.
01:13:03.820 | - Timers were the big ones.
01:13:05.060 | Like that was, you know, the timers were very popular
01:13:08.340 | right away.
01:13:09.540 | Music also, like you could play song, artist, album,
01:13:13.500 | everything.
01:13:14.940 | So that was like a clear win in terms
01:13:17.500 | of the customer experience.
01:13:19.460 | So that's, again, this is language understanding.
01:13:22.780 | Now things have evolved, right?
01:13:24.140 | So where we want Alexa definitely to be more accurate,
01:13:28.420 | competent, trustworthy based on how well it does
01:13:31.580 | these core things.
01:13:33.140 | But we have evolved in many different dimensions.
01:13:35.300 | First is what I think of it doing more conversational
01:13:38.420 | for high utility, not just for chat, right?
01:13:40.980 | And there at Remars this year, which is our AI conference,
01:13:44.940 | we launched what is called Alexa Conversations.
01:13:48.580 | That is providing the ability for developers
01:13:51.820 | to author multi-turn experiences on Alexa
01:13:55.060 | with no code essentially,
01:13:57.100 | in terms of the dialogue code.
01:13:58.900 | Initially it was like, you know, all these IVR systems,
01:14:02.620 | you have to fully author if the customer says this,
01:14:06.580 | do that, right?
01:14:07.580 | So the whole dialogue flow is hand authored.
01:14:11.460 | And with Alexa Conversations, the way it is
01:14:14.380 | that you just provide a sample interaction data
01:14:16.780 | with your service or an API, let's say your Atom tickets
01:14:19.140 | that provides a service for buying movie tickets.
01:14:23.420 | You provide a few examples of how your customers
01:14:25.860 | will interact with your APIs.
01:14:27.820 | And then the dialogue flow is automatically constructed
01:14:29.980 | using a record neural network, trained on that data.
01:14:33.380 | So that simplifies the developer experience.
01:14:35.940 | We just launched our preview for the developers
01:14:38.460 | to try this capability out.
01:14:40.620 | And then the second part of it,
01:14:42.140 | which shows even increased utility for customers,
01:14:45.740 | is you and I, when we interact with Alexa
01:14:48.940 | or any customer, as I'm coming back
01:14:51.780 | to our initial part of the conversation,
01:14:53.180 | the goal is often unclear or unknown to the AI.
01:14:58.180 | If I say, Alexa, what movies are playing nearby?
01:15:02.700 | Am I trying to just buy movie tickets?
01:15:08.020 | Am I actually even, do you think I'm looking
01:15:11.380 | for just movies for curiosity,
01:15:12.860 | whether the Avengers is still in theater or when is it?
01:15:15.900 | Maybe it's gone and maybe it will come on my missed it.
01:15:18.460 | So I may watch it on Prime, which happened to me.
01:15:22.660 | So from that perspective now,
01:15:25.460 | you're looking into what is my goal?
01:15:28.460 | And let's say I now complete the movie ticket purchase.
01:15:32.260 | Maybe I would like to get dinner nearby.
01:15:35.860 | So what is really the goal here?
01:15:40.420 | Is it night out or is it movies?
01:15:43.820 | As in just go watch a movie?
01:15:45.780 | The answer is, we don't know.
01:15:47.980 | So can Alexa now figure we have the intelligence
01:15:52.540 | that I think this meta goal is really night out
01:15:55.460 | or at least say to the customer,
01:15:57.580 | when you've completed the purchase of movie tickets
01:16:00.020 | from Atom Tickets or Fandango or Piccu or anyone,
01:16:03.260 | then the next thing is, do you want to get an Uber
01:16:06.260 | to the theater, right?
01:16:10.820 | Or do you want to book a restaurant next to it?
01:16:14.420 | And then not ask the same information over and over again,
01:16:18.980 | what time, how many people in your party, right?
01:16:23.980 | So this is where you shift the cognitive burden
01:16:28.060 | from the customer to the AI,
01:16:30.380 | where it's thinking of what is your,
01:16:33.540 | it anticipates your goal
01:16:35.540 | and takes the next best action to complete it.
01:16:38.820 | Now that's the machine learning problem.
01:16:42.140 | But essentially the way we solve this first instance
01:16:45.180 | and we have a long way to go to make it scale
01:16:48.220 | to everything possible in the world,
01:16:50.100 | but at least for this situation,
01:16:51.500 | it is from at every instance,
01:16:54.380 | Alexa is making the determination
01:16:55.980 | whether it should stick with the experience
01:16:57.620 | with Atom Tickets or offer,
01:17:00.260 | based on what you say,
01:17:03.780 | whether either you have completed the interaction
01:17:06.260 | or you said, no, get me an Uber now.
01:17:07.740 | So it will shift context into another experience or skill.
01:17:12.020 | Or another service.
01:17:12.860 | So that's a dynamic decision-making.
01:17:15.340 | That's making Alexa, you can say more conversational
01:17:18.140 | for the benefit of the customer,
01:17:20.180 | rather than simply complete transactions,
01:17:22.500 | which are well thought through.
01:17:25.220 | You as a customer has fully specified
01:17:27.780 | what you want to be accomplished.
01:17:29.660 | It's accomplishing that.
01:17:30.780 | - So it's kind of as,
01:17:32.420 | we do this with pedestrians, right?
01:17:34.260 | Intent modeling is predicting what your possible goals are
01:17:38.660 | and what's the most likely goal
01:17:39.980 | and switching that depending on the things you say.
01:17:42.380 | So my question is there,
01:17:44.380 | it seems maybe it's a dumb question,
01:17:46.500 | but it would help a lot if Alexa remembered me,
01:17:51.380 | what I said previously.
01:17:53.020 | - Right.
01:17:53.860 | - Is it trying to use some memory for the customer?
01:17:58.380 | - Yeah, it is using a lot of memory within that.
01:18:00.820 | So right now, not so much in terms of,
01:18:02.660 | okay, which restaurant do you prefer?
01:18:05.100 | Right, that is a more long-term memory,
01:18:06.820 | but within the short-term memory, within the session,
01:18:09.860 | it is remembering how many people did you,
01:18:11.820 | so if you said buy four tickets,
01:18:13.820 | now it has made an implicit assumption
01:18:15.660 | that you are going to have,
01:18:18.300 | you need at least four seats at a restaurant, right?
01:18:21.740 | So these are the kind of contexts it's preserving
01:18:24.300 | between these skills, but within that session.
01:18:26.820 | But you're asking the right question in terms of
01:18:29.860 | for it to be more and more useful,
01:18:32.180 | it has to have more long-term memory.
01:18:33.780 | And that's also an open question.
01:18:35.220 | And again, these are still early days.
01:18:37.260 | - So for me, I mean, everybody's different,
01:18:40.380 | but yeah, I'm definitely not representative
01:18:44.020 | of the general population in the sense
01:18:45.340 | that I do the same thing every day.
01:18:47.060 | Like I eat the same, I do everything the same,
01:18:50.340 | the same thing, wear the same thing, clearly,
01:18:53.540 | this or the black shirt.
01:18:55.540 | So it's frustrating when Alexa doesn't get what I'm saying
01:18:59.180 | because I have to correct her every time
01:19:02.100 | in the exact same way.
01:19:02.980 | This has to do with certain songs.
01:19:05.660 | Like she doesn't know certain weird songs.
01:19:08.420 | And doesn't know, I've complained to Spotify about this,
01:19:11.420 | talked to the head of R&D at Spotify,
01:19:14.060 | Stairway to Heaven, I have to correct it every time.
01:19:16.460 | - Really?
01:19:17.300 | - It doesn't play Led Zeppelin correctly.
01:19:18.780 | It plays a cover of Stairway to Heaven.
01:19:22.540 | - You should figure, you should send me your,
01:19:24.940 | next time it fails, feel free to send it to me,
01:19:27.500 | we'll take care of it.
01:19:28.460 | - Okay, well.
01:19:29.300 | - Because Led Zeppelin is one of my favorite brands
01:19:31.700 | and it works for me, so I'm like shocked
01:19:33.380 | it doesn't work for you.
01:19:34.220 | - This is an official bug report.
01:19:35.500 | I'll make it public, make everybody retweet it.
01:19:39.060 | We're gonna fix the Stairway to Heaven problem.
01:19:41.020 | Anyway, but the point is, I'm pretty boring
01:19:44.340 | and do the same things, but I'm sure most people
01:19:46.140 | do the same set of things.
01:19:48.380 | Do you see Alexa sort of utilizing that in the future
01:19:51.420 | for improving the experience?
01:19:52.820 | - Yes, and not only utilizing,
01:19:54.700 | it's already doing some of it.
01:19:56.220 | We call it, where Alexa is becoming more self-learning.
01:19:59.580 | So Alexa is now auto-correcting millions and millions
01:20:04.420 | of utterances in the US without any human supervision
01:20:07.940 | involved.
01:20:08.780 | The way it does it is, let's take an example
01:20:11.980 | of a particular song didn't work for you.
01:20:14.780 | What do you do next?
01:20:15.740 | You either, it played the wrong song and you said,
01:20:18.540 | Alexa, no, that's not the song I want.
01:20:20.780 | Or you say, Alexa, play that, you try it again.
01:20:25.220 | And that is a signal to Alexa that she may have done
01:20:29.020 | something wrong.
01:20:30.140 | And from that perspective, we can learn
01:20:33.540 | if there's that failure pattern or that action
01:20:36.740 | of song A was played when song B was requested.
01:20:41.100 | And it's very common with station names
01:20:43.100 | because play NPR, you can have N be confused as an M
01:20:47.220 | and then you, for a certain accent like mine,
01:20:50.980 | people confuse my N and M all the time.
01:20:54.780 | And because I have an Indian accent,
01:20:57.700 | they're confusable to humans.
01:20:59.660 | It is for Alexa too.
01:21:01.660 | And in that part, but it starts auto-correcting
01:21:05.140 | and we correct a lot of these automatically
01:21:09.740 | without a human looking at the failures.
01:21:12.740 | - So one of the things that's for me missing in Alexa,
01:21:17.420 | I don't know if I'm a representative customer,
01:21:19.780 | but every time I correct it, it would be nice to know
01:21:24.780 | that that made a difference.
01:21:26.220 | - Yes.
01:21:27.060 | - You know what I mean?
01:21:27.900 | Like the sort of like, I heard you,
01:21:30.380 | like a sort of-
01:21:31.940 | - Some acknowledgement of that.
01:21:33.860 | - We work a lot with Tesla, we study autopilot and so on.
01:21:37.500 | And a large amount of the customers
01:21:39.260 | that use Tesla autopilot,
01:21:40.740 | they feel like they're always teaching the system.
01:21:43.020 | They're almost excited by the possibility
01:21:44.460 | that they're teaching.
01:21:45.300 | I don't know if Alexa customers generally think of it
01:21:48.460 | as they're teaching to improve the system.
01:21:51.180 | And that's a really powerful thing.
01:21:52.700 | - Again, I would say it's a spectrum.
01:21:55.220 | Some customers do think that way
01:21:57.340 | and some would be annoyed by Alexa acknowledging that.
01:22:01.300 | So there's again, no one,
01:22:04.340 | while there are certain patterns,
01:22:05.740 | not everyone is the same in this way.
01:22:08.140 | But we believe that again, customers helping Alexa
01:22:13.140 | is a tenet for us in terms of improving it.
01:22:15.700 | And more self-learning is by, again,
01:22:18.260 | this is like fully unsupervised, right?
01:22:20.100 | There is no human in the loop and no labeling happening.
01:22:23.580 | And based on your actions as a customer,
01:22:27.100 | Alexa becomes smarter.
01:22:29.020 | Again, it's early days,
01:22:31.100 | but I think this whole area of teachable AI
01:22:35.780 | is gonna get bigger and bigger in the whole space,
01:22:38.620 | especially in the AI assistant space.
01:22:40.700 | So that's the second part
01:22:41.860 | where I mentioned more conversational,
01:22:44.780 | this is more self-learning.
01:22:46.460 | The third is more natural.
01:22:48.260 | And the way I think of more natural
01:22:50.220 | is we talked about how Alexa sounds.
01:22:53.220 | And we have done a lot of advances in our text to speech
01:22:58.020 | by using, again, neural network technology
01:23:00.420 | for it to sound very human-like.
01:23:03.460 | - From the individual texture of the sound
01:23:05.580 | to the timing, the tonality, the tone, everything.
01:23:08.580 | - Everything.
01:23:09.420 | I would think in terms of,
01:23:10.980 | there's a lot of controls in each of the places
01:23:13.340 | for how, I mean, the speed of the voice,
01:23:16.620 | the prosthetic patterns,
01:23:19.500 | the actual smoothness of how it sounds,
01:23:23.340 | all of those are factored.
01:23:24.380 | And we do a ton of listening tests to make sure.
01:23:27.100 | But naturalness, how it sounds should be very natural.
01:23:30.740 | How it understands requests is also very important.
01:23:33.660 | Like, and in terms of, like, we have 95,000 skills,
01:23:37.140 | and if we have, imagine that in many of these skills,
01:23:41.460 | you have to remember the skill name.
01:23:43.340 | And say, Alexa, ask the tied skill to tell me X.
01:23:49.300 | Or, now, if you have to remember the skill name,
01:23:52.660 | that means the discovery and the interaction is unnatural.
01:23:56.340 | And we are trying to solve that by what we think of as,
01:24:00.620 | again, this was, you don't have to have the app metaphor here.
01:24:05.420 | These are not individual apps, right?
01:24:07.140 | Even though they're, so you're not sort of opening
01:24:09.740 | one at a time and interacting.
01:24:11.100 | So it should be seamless because it's voice.
01:24:13.700 | And when it's voice, you have to be able
01:24:15.780 | to understand these requests,
01:24:17.260 | independent of the specificity, like a skill name.
01:24:20.300 | And to do that, what we have done is, again,
01:24:22.540 | built a deep learning-based capability
01:24:24.140 | where we shortlist a bunch of skills
01:24:26.740 | when you say, Alexa, get me a car.
01:24:28.580 | And then we figure it out, okay,
01:24:29.780 | it's meant for an Uber skill versus a Lyft
01:24:33.020 | or based on your preferences.
01:24:34.580 | And then you can rank the responses from the skill
01:24:38.020 | and then choose the best response for the customer.
01:24:40.980 | So that's on the more natural.
01:24:42.940 | Other examples of more natural is like,
01:24:46.060 | we were talking about lists, for instance,
01:24:48.780 | and you don't want to say, Alexa, add milk.
01:24:51.380 | Alexa, add eggs.
01:24:53.340 | Alexa, add cookies.
01:24:54.820 | No, Alexa, add cookies, milk, and eggs,
01:24:56.940 | and that in one shot, right?
01:24:58.900 | So that works, that helps with the naturalness.
01:25:01.420 | We talked about memory.
01:25:02.700 | Like if you said, you can say, Alexa, remember,
01:25:07.460 | I have to go to mom's house,
01:25:08.700 | or you may have entered a calendar event
01:25:10.860 | through your calendar that's linked to Alexa.
01:25:13.180 | You don't want to remember whether it's in my calendar
01:25:15.460 | or did I tell you to remember something
01:25:18.020 | or some other reminder, right?
01:25:20.620 | So you have to now,
01:25:23.060 | independent of how customers create these events,
01:25:27.100 | it should just say, Alexa,
01:25:28.020 | when do I have to go to mom's house?
01:25:29.460 | And it tells you when you have to go to mom's house.
01:25:31.940 | - That's a fascinating problem.
01:25:33.300 | Who's that problem on?
01:25:34.860 | So there's people who create skills.
01:25:37.100 | Who's tasked with integrating all of that knowledge together
01:25:42.420 | so the skills become seamless?
01:25:44.220 | Is it the creators of the skills?
01:25:46.100 | Or is it an infrastructure that Alexa provides problem?
01:25:50.820 | - It's both.
01:25:51.660 | I think the large problem in terms of making sure
01:25:54.500 | your skill quality is high,
01:25:56.260 | that has to be done by our tools
01:26:00.780 | because it's just, so these skills,
01:26:02.660 | just to put the context,
01:26:04.260 | they're built through Alexa Skills Kit,
01:26:05.860 | which is a self-serve way of building
01:26:08.700 | an experience on Alexa.
01:26:09.980 | This is like any developer in the world
01:26:12.500 | could go to Alexa Skills Kit
01:26:14.380 | and build an experience on Alexa.
01:26:16.380 | Like if you're a Domino's,
01:26:17.780 | you can build a Domino's skills,
01:26:19.700 | for instance, that does pizza ordering.
01:26:22.100 | When you've authored that,
01:26:23.980 | you do want to now,
01:26:27.860 | if people say, Alexa, open Domino's,
01:26:29.660 | or Alexa, ask Domino's to get a particular type of pizza,
01:26:34.660 | that will work, but the discovery is hard.
01:26:37.340 | You can't just say, Alexa, get me a pizza,
01:26:38.860 | and then Alexa figures out what to do.
01:26:41.980 | That latter part is definitely our responsibility
01:26:44.540 | in terms of when the request is not fully specific,
01:26:48.500 | how do you figure out what's the best skill
01:26:51.060 | or a service that can fulfill the customer's request?
01:26:55.620 | And it can keep evolving.
01:26:56.820 | Imagine going to the situation I said,
01:26:58.820 | which was the night out planning,
01:26:59.900 | that the goal could be more than that individual request
01:27:03.020 | that came up.
01:27:05.140 | A pizza ordering could mean a night in,
01:27:08.140 | where you're having an event with your kids
01:27:10.060 | in their house, and you're,
01:27:11.860 | so this is, welcome to the world of conversational AI.
01:27:14.780 | (laughs)
01:27:16.300 | - This is super exciting,
01:27:17.740 | because it's not the academic problem of NLP,
01:27:20.340 | of natural language processing, understanding, dialogue.
01:27:22.700 | This is like real world.
01:27:24.260 | And the stakes are high in the sense
01:27:26.740 | that customers get frustrated quickly,
01:27:29.620 | people get frustrated quickly,
01:27:31.420 | so you have to get it right,
01:27:32.740 | you have to get that interaction right.
01:27:34.900 | So it's, I love it.
01:27:36.500 | But, so from that perspective,
01:27:38.820 | what are the challenges today?
01:27:41.540 | What are the problems that really need to be solved
01:27:44.620 | in the next few years?
01:27:45.460 | - Yeah, I think first and foremost,
01:27:47.540 | as I mentioned that,
01:27:48.900 | get the basics right are still true.
01:27:52.900 | Basically, even the one short request,
01:27:56.620 | which we think of as transactional request
01:27:58.460 | needs to work magically, no question about that.
01:28:01.300 | If it doesn't turn your light on and off,
01:28:03.180 | you'll be super frustrated.
01:28:04.820 | Even if I can complete the night out for you
01:28:06.660 | and not do that, that is unacceptable as a customer, right?
01:28:10.300 | So that, you have to get the foundational understanding
01:28:13.700 | going very well.
01:28:15.020 | The second aspect, when I said more conversational,
01:28:17.340 | is as you imagine, is more about reasoning.
01:28:19.740 | It is really about figuring out what the latent goal is
01:28:23.940 | of the customer based on what I have the information now,
01:28:28.100 | and the history, and what's the next best thing to do.
01:28:30.940 | So that's a complete reasoning and decision making problem.
01:28:35.020 | Just like your self-driving car,
01:28:36.660 | but the goal is still more finite.
01:28:38.300 | Here, it evolves.
01:28:39.620 | Your environment is super hard and self-driving,
01:28:42.380 | and the cost of a mistake is huge.
01:28:45.860 | Here, but there are certain similarities.
01:28:48.140 | But if you think about how many decisions Alexa is making
01:28:52.260 | or evaluating at any given time,
01:28:53.860 | it's a huge hypothesis space.
01:28:56.100 | And we're only talked about so far
01:28:59.380 | about what I think of reactive decision
01:29:01.660 | in terms of you asked for something
01:29:03.260 | and Alexa is reacting to it.
01:29:05.540 | If you bring the proactive part,
01:29:07.380 | which is Alexa having hunches.
01:29:09.660 | So any given instance, then it's really a decision
01:29:14.060 | at any given point based on the information.
01:29:16.820 | Alexa has to determine what's the best thing it needs to do.
01:29:19.740 | So these are the ultimate AI problem
01:29:22.140 | about decisions based on the information you have.
01:29:24.700 | - Do you think, just from my perspective,
01:29:27.100 | I work a lot with sensing of the human face.
01:29:30.700 | Do you think, and we touched this topic a little bit earlier
01:29:34.020 | but do you think there'll be a day soon
01:29:36.140 | when Alexa can also look at you
01:29:38.500 | to help improve the quality of the hunch it has,
01:29:42.780 | or at least detect frustration or detect,
01:29:47.580 | improve the quality of its perception
01:29:51.180 | of what you're trying to do?
01:29:53.940 | - I mean, let me again bring back to what it already does.
01:29:56.740 | We talked about how based on you barge in over Alexa,
01:30:01.420 | clearly it's a very high probability
01:30:04.580 | it must have done something wrong.
01:30:06.180 | That's why you barged in.
01:30:08.140 | The next extension of whether frustration is a signal or not,
01:30:12.860 | of course, is a natural thought in terms of
01:30:15.620 | how that should be in a signal to it.
01:30:17.820 | - You can get that from voice.
01:30:19.140 | - You can get from voice, but it's very hard.
01:30:20.900 | Like, I mean, frustration as a signal, historically,
01:30:25.540 | if you think about emotions of different kinds,
01:30:28.060 | you know, there's a whole field of affective computing,
01:30:31.020 | something that MIT has also done a lot of research in,
01:30:34.100 | is super hard.
01:30:35.180 | And you're now talking about a far field device,
01:30:38.620 | as in you're talking to a distance, noisy environment.
01:30:41.500 | And in that environment,
01:30:43.660 | it needs to have a good sense for your emotions.
01:30:47.100 | This is a very, very hard problem.
01:30:49.020 | - Very hard problem,
01:30:49.860 | but you haven't shied away from hard problems.
01:30:51.460 | (Satyachit laughing)
01:30:53.060 | So deep learning has been at the core
01:30:54.820 | of a lot of this technology.
01:30:56.980 | Are you optimistic about the current
01:30:58.260 | deep learning approaches to solving the hardest aspects
01:31:01.060 | of what we're talking about?
01:31:02.820 | Or do you think there will come a time
01:31:04.940 | where new ideas need to,
01:31:06.820 | for the, you know, if we look at reasoning,
01:31:08.940 | so open AI, deep mind,
01:31:10.300 | a lot of folks are now starting to work in reasoning,
01:31:13.460 | trying to see how we can make neural networks reason.
01:31:16.180 | Do you see that new approaches need to be invented
01:31:20.100 | to take the next big leap?
01:31:22.940 | - Absolutely, I think there has to be a lot more investment
01:31:26.820 | and I think in many different ways.
01:31:29.020 | And there are these, I would say,
01:31:30.820 | nuggets of research forming in a good way,
01:31:33.180 | like learning with less data
01:31:35.740 | or like zero-shot learning, one-shot learning.
01:31:39.300 | - And the active learning stuff you've talked about
01:31:41.020 | is incredible stuff.
01:31:42.780 | - So transfer learning is also super critical,
01:31:45.300 | especially when you're thinking about applying knowledge
01:31:48.220 | from one task to another or one language to another, right?
01:31:51.660 | That's really ripe.
01:31:52.940 | So these are great pieces.
01:31:55.260 | Deep learning has been useful too.
01:31:56.740 | And now we are sort of matting deep learning
01:31:58.820 | with transfer learning and active learning.
01:32:02.420 | Of course, that's more straightforward
01:32:04.420 | in terms of applying deep learning
01:32:05.820 | in an active learning setup.
01:32:06.900 | But I do think in terms of now looking
01:32:11.900 | into more reasoning-based approaches
01:32:14.180 | is going to be key for our next wave of the technology.
01:32:19.180 | But there is a good news.
01:32:20.780 | The good news is that I think
01:32:22.180 | for keeping on to delight customers,
01:32:24.340 | that a lot of it can be done by prediction tasks.
01:32:27.180 | And so we haven't exhausted that.
01:32:30.620 | So we don't need to give up
01:32:34.380 | on the deep learning approaches for that.
01:32:37.220 | So that's just, I wanted to sort of point that out.
01:32:39.380 | - So creating a rich, fulfilling, amazing experience
01:32:42.500 | that makes Amazon a lot of money
01:32:44.140 | and a lot of everybody a lot of money
01:32:46.300 | because it does awesome things, deep learning is enough.
01:32:49.780 | The point--
01:32:50.980 | - I don't think, no, I wouldn't say deep learning is enough.
01:32:54.060 | I think for the purposes of Alexa
01:32:56.580 | and accomplish the task for customers,
01:32:58.340 | I'm saying there are still a lot of things we can do
01:33:02.100 | with prediction-based approaches that do not reason.
01:33:05.060 | Right, I'm not saying that, and we haven't exhausted those.
01:33:08.500 | But for the kind of high utility experiences
01:33:12.340 | that I'm personally passionate about
01:33:14.140 | of what Alexa needs to do, reasoning has to be solved
01:33:18.700 | to the same extent as you can think
01:33:20.940 | of natural language understanding
01:33:23.500 | and speech recognition to the extent
01:33:25.420 | of understanding intents has been,
01:33:28.940 | how accurate it has become.
01:33:30.060 | But reasoning, we have very, very early days.
01:33:32.700 | - Let me ask that another way.
01:33:33.940 | How hard of a problem do you think that is?
01:33:36.700 | - Hardest of them.
01:33:37.740 | (laughing)
01:33:39.100 | I would say hardest of them because again,
01:33:41.660 | the hypothesis space is really, really large.
01:33:47.700 | And when you go back in time, like you were saying,
01:33:50.300 | I want Alexa to remember more things,
01:33:53.180 | that once you go beyond a session of interaction,
01:33:56.460 | which is by session, I mean a time span, which is today,
01:34:00.740 | to versus remembering which restaurant I like.
01:34:03.300 | And then when I'm planning a night out to say,
01:34:05.620 | do you wanna go to the same restaurant?
01:34:07.660 | Now you're up the stakes big time.
01:34:09.900 | And this is where the reasoning dimension
01:34:12.980 | also goes way, way bigger.
01:34:14.940 | - So you think the space,
01:34:16.980 | we'll be elaborating that a little bit.
01:34:19.340 | Just philosophically speaking, do you think,
01:34:22.220 | when you reason about trying to model
01:34:24.740 | what the goal of a person is in the context
01:34:28.300 | of interacting with Alexa, you think that space is huge?
01:34:31.340 | - It's huge, absolutely huge.
01:34:33.100 | - Do you think, so like another sort of devil's advocate
01:34:36.100 | would be that we human beings are really simple
01:34:38.780 | and we all want just a small set of things.
01:34:41.500 | So do you think it's possible?
01:34:44.740 | 'Cause we're not talking about
01:34:47.100 | a fulfilling general conversation.
01:34:49.340 | Perhaps actually the Alexa prize
01:34:51.020 | is a little bit more about after that.
01:34:53.420 | Creating a customer, like there's so many
01:34:56.140 | of the interactions, it feels like are clustered
01:35:01.140 | in groups that don't require general reasoning.
01:35:06.140 | - I think, yeah, you're right in terms of the head
01:35:09.420 | of the distribution of all the possible things
01:35:11.900 | customers may wanna accomplish.
01:35:13.820 | The tail is long and it's diverse.
01:35:17.220 | So from that-- - There's many long tails.
01:35:20.540 | - There are many, so from that perspective,
01:35:23.580 | I think you have to solve that problem.
01:35:25.940 | Otherwise, and everyone's very different.
01:35:28.860 | Like, I mean, we see this already
01:35:30.500 | in terms of the skills, right?
01:35:32.420 | I mean, if you're an average surfer, which I am not,
01:35:36.060 | but somebody is asking Alexa about surfing conditions,
01:35:41.780 | and there's a skill that is there for them to get to, right?
01:35:45.580 | That tells you that the tail is massive,
01:35:47.940 | like in terms of like what kind of skills
01:35:50.820 | people have created, it's humongous in terms of it.
01:35:54.300 | And which means there are these diverse needs.
01:35:57.060 | And when you start looking at the combinations of these,
01:36:01.020 | even if you had pairs of skills and 90,000 choose two,
01:36:05.500 | it's still a big set of combinations.
01:36:08.020 | So I'm saying there's a huge to do here now.
01:36:11.780 | And I think customers are wonderfully frustrated
01:36:16.780 | with things and they have to keep getting
01:36:19.860 | to do better things for them.
01:36:20.940 | - So they're not known to be super patient.
01:36:24.020 | So you have to-- - Do it fast.
01:36:25.620 | - You have to do it fast.
01:36:27.020 | So you've mentioned the idea of a press release,
01:36:29.900 | the research and development, Amazon, Alexa,
01:36:33.980 | and Amazon in general, you kind of think
01:36:35.780 | of what the future product will look like,
01:36:37.300 | and you kind of make it happen, you work backwards.
01:36:40.060 | So can you draft for me, you probably already have one,
01:36:43.900 | but can you make up one for 10, 20, 30, 40 years out
01:36:48.820 | that you see the Alexa team putting out,
01:36:52.740 | just in broad strokes, something that you dream about?
01:36:56.500 | - I think let's start with the five years first.
01:37:00.100 | (laughing)
01:37:00.940 | So, and I'll get to the 40 is too hard to pick.
01:37:03.700 | - 'Cause I'm pretty sure you have a real five year one.
01:37:05.820 | (laughing)
01:37:08.300 | But yeah, in broad strokes, let's start with five years.
01:37:10.140 | - I think the five year is where, I mean,
01:37:11.780 | I think of in these spaces, it's hard,
01:37:14.620 | especially if you're in thick of things
01:37:16.140 | to think beyond the five year space,
01:37:17.940 | because a lot of things change, right?
01:37:20.300 | I mean, if you asked me five years back,
01:37:22.220 | will Alexa will be here?
01:37:24.220 | I wouldn't have, I think it has surpassed
01:37:26.340 | my imagination of that time, right?
01:37:29.020 | So I think from the next five years perspective,
01:37:33.140 | from a AI perspective, what we're gonna see
01:37:37.100 | is that notion which you said, goal-oriented dialogues
01:37:40.380 | and open domain like Alexa Prize,
01:37:42.420 | I think that bridge is gonna get closed.
01:37:45.220 | They won't be different.
01:37:46.380 | And I'll give you why that's the case.
01:37:48.500 | You mentioned shopping, how do you shop?
01:37:52.300 | Do you shop in one shot?
01:37:55.700 | Sure, your AA batteries, paper towels, yes.
01:38:00.340 | How long does it take for you to buy a camera?
01:38:02.820 | You do ton of research.
01:38:05.980 | Then you make a decision.
01:38:07.540 | So is that a goal-oriented dialogue
01:38:11.500 | when somebody says, "Alexa, find me a camera?"
01:38:15.540 | Is it simply inquisitiveness?
01:38:17.620 | Right, so even in the something that you think of it
01:38:20.460 | as shopping, which you said you yourself use a lot of,
01:38:24.060 | if you go beyond where it's reorders
01:38:27.420 | or items where you sort of are not brand conscious
01:38:32.420 | and so forth, that was just in shopping.
01:38:35.100 | Just to comment quickly,
01:38:36.220 | I've never bought anything through Alexa
01:38:38.140 | that I haven't bought before on Amazon on the desktop
01:38:41.260 | after I clicked on a bunch of reviews, that kind of stuff.
01:38:44.860 | So it's repurchase.
01:38:45.900 | So now you think in, even for something that you felt like
01:38:49.500 | is a finite goal, I think the space is huge
01:38:52.700 | because even products, the attributes are many.
01:38:56.380 | Like, and you wanna look at reviews,
01:38:58.340 | some on Amazon, some outside,
01:39:00.100 | some you wanna look at what CNET is saying
01:39:02.060 | or another consumer forum is saying
01:39:05.340 | about even a product, for instance, right?
01:39:07.020 | So that's just shopping where you could argue
01:39:11.780 | the ultimate goal is sort of known.
01:39:14.060 | And we haven't talked about,
01:39:15.420 | Alexa, what's the weather in Cape Cod this weekend?
01:39:18.660 | Right, so why am I asking that weather question, right?
01:39:22.580 | So I think of it as how do you complete goals
01:39:27.580 | with minimum steps for our customers, right?
01:39:30.140 | And when you think of it that way,
01:39:32.460 | the distinction between goal-oriented and conversations
01:39:36.020 | for open domain sake goes away.
01:39:38.660 | I may wanna know what happened
01:39:41.740 | in the presidential debate, right?
01:39:43.580 | And is it, I'm seeking just information
01:39:45.860 | or I'm looking at who's winning the debates, right?
01:39:49.620 | So these are all quite hard problems.
01:39:53.420 | So even the five-year horizon problem,
01:39:55.620 | I'm like, I sure hope we'll solve these.
01:39:59.900 | - And you're optimistic 'cause that's a hard problem.
01:40:03.420 | - Which part?
01:40:04.260 | - The reasoning enough to be able to help explore
01:40:09.260 | complex goals that are beyond something simplistic.
01:40:12.300 | That feels like it could be, well, five years is a nice--
01:40:16.580 | - Is a nice bar for it, right?
01:40:18.260 | I think you will, it's a nice ambition.
01:40:21.260 | And do we have press releases for that?
01:40:23.740 | Absolutely, can I tell you what specifically
01:40:25.860 | the roadmap will be?
01:40:26.700 | No, right?
01:40:28.100 | And will we solve all of it in the five-year space?
01:40:31.780 | No, we'll work on this forever, actually.
01:40:35.580 | This is the hardest of the AI problems.
01:40:37.980 | And I don't see that being solved even
01:40:40.780 | on a 40-year horizon, because even if you limit
01:40:44.020 | to the human intelligence,
01:40:45.260 | we know we are quite far from that.
01:40:47.700 | In fact, every aspects of our sensing
01:40:50.300 | to neural processing, to how brain stores information
01:40:55.300 | and how it processes it, we don't yet know
01:40:57.180 | how to represent knowledge, right?
01:40:59.100 | So we are still in those early stages.
01:41:03.020 | So I wanted to start, that's why at the five-year,
01:41:06.460 | because the five-year success would look like that
01:41:09.220 | in solving these complex goals.
01:41:11.340 | And the 40-year would be where it's just natural
01:41:14.660 | to talk to these in terms of more of these complex goals.
01:41:18.820 | Right now, we've already come to the point
01:41:20.100 | where these transactions you mentioned of asking
01:41:23.500 | for weather or reordering something,
01:41:25.860 | or listening to your favorite tune,
01:41:28.700 | it's natural for you to ask Alexa.
01:41:30.980 | It's now unnatural to pick up your phone, right?
01:41:34.020 | And that I think is the first five-year transformation.
01:41:36.700 | The next five-year transformation would be,
01:41:38.940 | okay, I can plan my weekend with Alexa,
01:41:41.100 | or I can plan my next meal with Alexa,
01:41:43.780 | or my next night out with seamless effort.
01:41:47.940 | - So just to pause and look back
01:41:49.540 | at the big picture of it all,
01:41:51.940 | you're part of a large team that's creating a system
01:41:56.940 | that's in the home that's not human,
01:42:00.900 | that gets to interact with human beings.
01:42:02.780 | So we human beings, we these descendants of apes,
01:42:06.140 | have created an artificial intelligence system
01:42:08.980 | that's able to have conversations.
01:42:10.980 | I mean, that to me,
01:42:13.060 | the two most transformative robots of this century,
01:42:20.060 | I think, will be autonomous vehicles,
01:42:22.060 | but they're a little bit transformative
01:42:24.820 | in a more boring way.
01:42:26.420 | It's like a tool.
01:42:28.180 | I think conversational agents in the home
01:42:32.900 | is like an experience.
01:42:34.700 | How does that make you feel,
01:42:36.180 | that you're at the center of creating that?
01:42:38.340 | Do you sit back in awe sometimes?
01:42:42.980 | What is your feeling about the whole mess of it?
01:42:47.420 | Can you even believe that we're able
01:42:49.060 | to create something like this?
01:42:50.900 | - I think it's a privilege.
01:42:52.540 | I'm so fortunate where I ended up.
01:42:56.700 | And it's been a long journey.
01:43:00.860 | Like I've been in this space for a long time in Cambridge,
01:43:03.860 | and it's so heartwarming to see the kind of adoption
01:43:08.860 | conversational agents are having now.
01:43:11.500 | Five years back, it was almost like,
01:43:14.580 | should I move out of this?
01:43:16.220 | Because we are unable to find the skill or application
01:43:19.900 | that customers would love,
01:43:21.300 | that would not simply be a good to have thing
01:43:24.420 | in research labs.
01:43:26.060 | And it's so fulfilling to see it make a difference
01:43:29.140 | to millions and billions of people worldwide.
01:43:32.180 | The good thing is that it's still very early.
01:43:34.380 | So I have another 20 years of job security
01:43:37.300 | doing what I love.
01:43:38.460 | So I think from that perspective,
01:43:40.540 | I tell every researcher that joins
01:43:44.260 | or every member of my team,
01:43:46.220 | that this is a unique privilege.
01:43:47.620 | Like I think, and we have,
01:43:49.580 | and I would say not just launching Alexa in 2014,
01:43:52.740 | which was first of its kind.
01:43:54.340 | Along the way, we have,
01:43:55.980 | when we launched Alexa Skills Kit,
01:43:57.380 | it became democratizing AI.
01:43:59.660 | When before that, there was no good evidence
01:44:02.420 | of an SDK for speech and language.
01:44:04.940 | Now we are coming to this where you and I
01:44:06.620 | are having this conversation where I'm not saying,
01:44:10.300 | oh, Lex, planning a night out with an AI agent, impossible.
01:44:14.540 | I'm saying it's in the realm of possibility.
01:44:17.100 | And not only possibility, we'll be launching this, right?
01:44:19.460 | So some elements of that,
01:44:21.500 | every, it will keep getting better.
01:44:23.740 | We know that is a universal truth.
01:44:25.580 | Once you have these kinds of agents out there being used,
01:44:30.140 | they get better for your customers.
01:44:32.020 | And I think that's where,
01:44:33.900 | I think the amount of research topics
01:44:36.540 | we are throwing out at our budding researchers
01:44:39.420 | is just gonna be exponentially hard.
01:44:41.780 | And the great thing is you can now get immense satisfaction
01:44:45.580 | by having customers use it,
01:44:47.220 | not just a paper in NeurIPS or another conference.
01:44:51.100 | - I think everyone, myself included,
01:44:53.100 | are deeply excited about that feature.
01:44:54.780 | So I don't think there's a better place to end.
01:44:57.500 | Rohit, thank you so much for talking to us.
01:44:58.340 | - Thank you so much.
01:44:59.180 | - This was fun.
01:45:00.300 | - Thank you, same here.
01:45:02.180 | - Thanks for listening to this conversation
01:45:04.180 | with Rohit Prasad.
01:45:05.700 | And thank you to our presenting sponsor, Cash App.
01:45:08.820 | Download it, use code LEXPODCAST.
01:45:11.540 | You'll get $10 and $10 will go to FIRST,
01:45:14.660 | a STEM education nonprofit
01:45:16.500 | that inspires hundreds of thousands of young minds
01:45:19.700 | to learn and to dream of engineering our future.
01:45:23.260 | If you enjoy this podcast, subscribe on YouTube,
01:45:26.180 | give it five stars on Apple Podcast,
01:45:28.140 | support it on Patreon, or connect with me on Twitter.
01:45:31.660 | And now let me leave you with some words of wisdom
01:45:34.900 | from the great Alan Turing.
01:45:37.420 | "Sometimes it is the people no one can imagine anything of
01:45:41.660 | who do the things no one can imagine."
01:45:44.260 | Thank you for listening and hope to see you next time.
01:45:48.380 | (upbeat music)
01:45:50.980 | (upbeat music)
01:45:53.580 | [BLANK_AUDIO]