back to indexRohit Prasad: Amazon Alexa and Conversational AI | Lex Fridman Podcast #57
Chapters
0:0
19:46 How Do the Conversations Evolve
48:15 Is Alexa Listening
52:32 Follow-Up Mode
53:2 Alexa Guard
54:9 History of Alexa
58:11 Speech Recognition
59:51 Scale Deep Learning
67:48 Multi-Domain Natural Language Understanding
70:32 Entity Resolution
72:27 Echo Plug
73:45 Alexa Conversations
79:57 Self Learning
87:39 Challenges
91:59 Transfer Learning
105:33 Words of Wisdom
00:00:00.000 |
The following is a conversation with Rohit Prasad. 00:00:02.960 |
He's the vice president and head scientist of Amazon Alexa 00:00:08.880 |
The Alexa team embodies some of the most challenging, 00:00:19.120 |
at the cutting edge of natural language processing 00:00:24.000 |
and enjoyable experience to millions of people. 00:00:27.440 |
This is where state-of-the-art methods in computer science 00:00:30.800 |
meet the challenges of real-world engineering. 00:00:33.720 |
In many ways, Alexa and the other voice assistants 00:00:39.480 |
to millions of people and an introduction to AI 00:00:43.160 |
for people who have only encountered it in science fiction. 00:00:46.920 |
This is an important and exciting opportunity. 00:00:49.960 |
And so the work that Rohit and the Alexa team are doing 00:00:52.920 |
is an inspiration to me and to many researchers 00:01:13.680 |
If you leave a review on Apple Podcasts especially, 00:01:20.000 |
consider mentioning topics, people, ideas, questions, quotes 00:01:23.600 |
in science, tech, or philosophy that you find interesting. 00:01:41.920 |
all the different ways that people can be A players. 00:01:49.200 |
that raw productivity is the measure of excellence, 00:01:53.400 |
I've worked with people who brought a smile to my face 00:01:57.880 |
Their contribution to the team is immeasurable. 00:02:04.600 |
I'll do one or two minutes after introducing the episode 00:02:20.280 |
I personally use Cash App to send money to friends, 00:02:32.040 |
say $1 worth, no matter what the stock price is. 00:02:35.760 |
Brokerage services are provided by Cash App Investing, 00:02:44.400 |
to support one of my favorite organizations called FIRST, 00:02:47.520 |
best known for their FIRST Robotics and Lego competitions. 00:02:50.880 |
They educate and inspire hundreds of thousands of students 00:02:56.240 |
and have a perfect rating on Charity Navigator, 00:03:03.440 |
When you get Cash App from the App Store, Google Play, 00:03:15.080 |
that I've personally seen inspire girls and boys 00:03:20.880 |
This podcast is also supported by ZipRecruiter. 00:03:27.720 |
and to me is one of the most important elements 00:03:44.160 |
ZipRecruiter is a tool that's already available for you. 00:03:47.240 |
It seeks to make hiring simple, fast, and smart. 00:03:50.720 |
For example, Codable co-founder Gretchen Huebner 00:04:02.160 |
Gretchen found it easier to focus on the best candidates, 00:04:05.080 |
and finally, hiring the perfect person for the role 00:04:16.640 |
for businesses of all sizes by signing up, as I did, 00:04:28.360 |
And now, here's my conversation with Rohit Prasad. 00:04:33.400 |
In the movie "Her," I'm not sure if you've ever seen it, 00:04:37.600 |
human falls in love with the voice of an AI system. 00:04:41.200 |
Let's start at the highest philosophical level 00:04:43.400 |
before we get to deep learning and some of the fun things. 00:04:46.600 |
Do you think this, what the movie "Her" shows, 00:04:55.560 |
but I think what we are seeing is a massive increase 00:05:13.600 |
and some of the functionalities that are shown 00:05:28.920 |
do you think such a close connection is possible 00:05:38.960 |
which are both human-like and in these AI assistance, 00:05:49.400 |
AI assistance can be in multiple places at the same time, 00:05:58.000 |
So you have to respect these superhuman capabilities too. 00:06:11.640 |
what they're great at is computation, memory. 00:06:15.960 |
These are the attributes you have to start respecting. 00:06:19.680 |
versus the other aspect, which is also superhuman, 00:06:28.560 |
- So there's certainly elements where you just mentioned, 00:06:31.800 |
Alexa is everywhere, computation is speaking. 00:06:37.040 |
than just the thing that sits there in the room with you. 00:06:44.560 |
that there's just another little creature there 00:06:51.440 |
of the infrastructure, you're interacting with the device. 00:06:53.880 |
The feeling is, okay, sure, we anthropomorphize things, 00:07:03.640 |
the purity of the interaction with a smart assistant, 00:07:06.680 |
what do you think we look for in that interaction? 00:07:12.240 |
I think will be very much where it does feel like a human, 00:07:25.200 |
and you just want to turn on your lights on and off, 00:07:29.840 |
that's not very much like a human-like interaction. 00:07:35.240 |
Just, it should simply complete that command. 00:07:40.200 |
we have to think about this as not human-human alone. 00:07:51.640 |
- So I told you, it's going to be philosophical in parts. 00:07:55.040 |
What's the difference between human and machine 00:08:04.000 |
versus you and a machine that you also are close with. 00:08:09.000 |
- I think you have to think about the roles the AI plays. 00:08:14.040 |
And it differs from different customer to customer, 00:08:17.980 |
Especially I can speak from Alexa's perspective. 00:08:27.480 |
So I think most AIs will have this kind of attributes, 00:08:34.640 |
I think the boundary depends on exact context 00:08:50.080 |
You know, there's a lot of criticism of that kind of test, 00:08:52.260 |
but what do you think is a good test of intelligence, 00:08:55.800 |
in your view, in the context of the Turing test? 00:09:05.320 |
human intelligence, what it means to define it, 00:09:22.860 |
and those are basically a data collection mechanism. 00:09:30.600 |
And from that perspective, I think there are elements 00:09:34.480 |
we have to talk about how we sense the world, 00:09:43.680 |
But then there's the other aspects of computation 00:09:54.240 |
And the retrieval can be extremely fast and pure, 00:10:02.080 |
I mean, machines can remember that quite well. 00:10:06.880 |
I do subscribe to the fact that to be able to converse, 00:10:13.440 |
based on the world knowledge you've acquired, 00:10:18.340 |
is definitely very much the essence of intelligence. 00:10:22.100 |
But intelligence can go beyond human level intelligence 00:10:26.960 |
based on what machines are getting capable of. 00:10:29.800 |
- So what do you think, maybe stepping outside of Alexa, 00:10:35.760 |
what do you think is a good test of intelligence? 00:10:44.920 |
On the research side, what would impress the heck out of you 00:10:47.960 |
if you saw, you know, what is the test where you said, 00:11:04.360 |
So in some sense, and I think we are quite far from that. 00:11:11.480 |
is that the Alexa's intelligence capability is a great test. 00:11:16.480 |
I think of it as, there are many other proof points, 00:11:20.600 |
like self-driving cars, game playing, like Go or chess. 00:11:28.660 |
Clearly requires a lot of data-driven learning 00:11:31.780 |
and intelligence, but it's not as hard a problem 00:11:39.740 |
to accomplish certain tasks or open domain chat, 00:11:43.980 |
In those settings, the key difference is that 00:11:48.180 |
the end goal is not defined, unlike game playing. 00:11:51.900 |
You also do not know exactly what state you are in 00:11:58.960 |
In certain sense, sometimes you can, if it is a simple goal, 00:12:02.100 |
but if you're, even certain examples like planning a weekend 00:12:05.620 |
or you can imagine how many things change along the way. 00:12:09.900 |
You look for weather, you may change your mind 00:12:17.060 |
and then you decide, no, I want this other event 00:12:20.540 |
So these dimensions of how many different steps 00:12:24.020 |
are possible when you're conversing as a human 00:12:26.380 |
with a machine makes it an extremely daunting problem. 00:12:29.140 |
And I think it is the ultimate test for intelligence. 00:12:42.340 |
natural language is a great test, but I would go beyond, 00:12:46.500 |
I don't want to limit it to as natural language 00:12:58.500 |
is definitely one of the best tests of intelligence. 00:13:02.980 |
- So can you briefly speak to the Alexa Prize 00:13:12.660 |
and what have you learned and what's surprising? 00:13:18.460 |
- Absolutely, it's a very exciting competition. 00:13:26.880 |
where we threw the gauntlet to the universities 00:13:42.480 |
talking to someone who you're meeting for the first time, 00:13:57.760 |
We have completed two successful years of the competition. 00:14:01.600 |
The first was one with the University of Washington, 00:14:06.880 |
We have an extremely strong team of 10 cohorts, 00:14:09.640 |
and the third instance of the Alexa Prize is underway now. 00:14:31.560 |
- Just a few quick questions, sorry for the interruption. 00:14:33.900 |
What does failure look like in the 20 minute session? 00:14:46.560 |
but the quality of the conversation too that matters. 00:14:51.480 |
before I answer that question on what failure means, 00:15:00.840 |
So during the judging phases, there are multiple phases, 00:15:18.960 |
all the judging is essentially by the customers of Alexa. 00:15:22.720 |
And there you basically rate on a simple question, 00:15:40.120 |
So did you really break that 20 minute barrier 00:15:42.840 |
is why we have to test it in a more controlled setting 00:15:57.040 |
with real customers versus in the lab to award the prize. 00:16:08.040 |
and two of them have to say this conversation 00:16:20.920 |
How far, so the DARPA challenge in the first year, 00:16:37.720 |
to the extent that we're definitely not close 00:16:40.440 |
to the 20 minute barrier being with coherence 00:16:54.080 |
and what kind of responses these social bots generate 00:16:59.160 |
What's even amazing to see that now there's humor coming in. 00:17:06.600 |
- You're talking about ultimate science of intelligence. 00:17:25.040 |
not only what we think of natural language abilities, 00:17:30.360 |
and aspects of when to inject an appropriate joke, 00:17:38.400 |
how you come back with something more intelligible 00:17:45.160 |
and we are domain experts, we can speak to it. 00:17:47.480 |
But if you suddenly switch a topic to that I don't know of, 00:17:52.120 |
So you're starting to notice these elements as well. 00:18:05.600 |
and essentially mask some of the understanding defects 00:18:09.840 |
- So some of this, this is not Alexa the product. 00:18:17.800 |
I have a question sort of in this modern era, 00:18:32.280 |
Are people in this context pushing the limits? 00:18:45.960 |
as part of the dialogue to really draw people in? 00:18:54.280 |
I think fun is more part of the engaging part for customers. 00:19:04.360 |
But up that apart, the real goal was essentially 00:19:07.200 |
what was happening is with a lot of AI research 00:19:13.560 |
has the risk of not being able to have the same resources 00:19:16.800 |
at disposal that we have, which is lots of data, 00:19:24.640 |
to test these AI advances with real customer benefits. 00:19:28.520 |
So we brought all these three together in the Alexa Prize. 00:19:30.880 |
That's why it's one of my favorite projects in Amazon. 00:19:37.520 |
yes, it has become engaging for our customers as well. 00:19:40.960 |
We're not there in terms of where we want it to be, right? 00:19:59.120 |
- Certain keywords and so on. - It's more than keywords. 00:20:06.960 |
these words can be very contextual, as you can see, 00:20:21.800 |
for the conversation to be more useful for advancing AI 00:20:25.960 |
and not so much of these other issues you attributed, 00:20:32.920 |
- Right, so this is actually a serious opportunity. 00:20:53.240 |
It's also, if you think about the other aspect 00:20:57.960 |
of where the whole industry is moving with AI, 00:21:01.440 |
there's a dearth of talent given the demands. 00:21:04.920 |
So you do want universities to have a clear place 00:21:09.920 |
where they can invent and research and not fall behind 00:21:13.960 |
Imagine all grad students left to industry like us 00:21:22.920 |
So this is a way that if you're so passionate 00:21:34.520 |
- So what do you think it takes to build a system 00:21:54.200 |
and responding to those rather than really reasoning 00:22:15.840 |
that are being mentioned so that the conversation 00:22:19.120 |
is coherent rather than you suddenly just switch 00:22:26.720 |
than understanding the true context of the game. 00:22:28.760 |
Like if you just said, I learned this fun fact 00:22:32.360 |
about Tom Brady rather than really say how he played 00:22:39.360 |
then the conversation is not really that intelligent. 00:22:53.760 |
because a lot of times it's more facts being looked up 00:22:57.440 |
and something that's close enough as an answer, 00:23:02.080 |
So that is where the research needs to go more 00:23:08.400 |
And that's why I feel it's a great way to do it 00:23:13.520 |
working to make, help these AI advances happen in this case. 00:23:22.160 |
What is the experience for the user that's helping? 00:23:26.520 |
So just to clarify, this isn't, as far as I understand, 00:23:35.360 |
It's not you ordering certain things on Amazon.com 00:23:38.080 |
or checking the weather or playing Spotify, right? 00:23:45.600 |
I don't know, how do people, how do customers think of it? 00:23:54.680 |
And let me tell you how you invoke this skill. 00:24:00.240 |
And then the first time you say, Alexa, let's chat, 00:24:16.240 |
And we have a lot of mechanisms where as the, 00:24:23.680 |
then you send a lot of emails to our customers 00:24:26.720 |
and then they know that the team needs a lot of interactions 00:24:35.880 |
who really want to help these university bots 00:24:43.960 |
And also some adversarial behavior to see whether, 00:24:55.280 |
if we talk about solving the Alexa challenge, 00:25:07.480 |
'Cause if we think of this as a supervised learning problem, 00:25:12.160 |
but if it does, maybe you can comment on that. 00:25:22.560 |
- I think that's part of the research question here. 00:25:29.160 |
which is have a way for universities to build 00:25:35.760 |
Now you're asking in terms of the next phase of questions, 00:25:41.040 |
what does success look like from a optimization function? 00:25:47.120 |
we as researchers are used to having a great corpus 00:25:52.560 |
then sort of tune our algorithms on those, right? 00:26:10.880 |
That is another element that's unique where just now, 00:26:17.240 |
and experience this capability as a customer. 00:26:23.560 |
So they ask you a simple question on a scale of one to five, 00:26:27.480 |
how likely are you to interact with this social bot again? 00:26:33.800 |
and customers can also leave more open-ended feedback. 00:26:44.560 |
that as researchers also, you have to change your mindset 00:26:48.520 |
that this is not a DARPA evaluation or an NSF funded study 00:26:54.920 |
This is where it's real world, you have real data. 00:27:01.520 |
And then the customer, the user can quit the conversation 00:27:07.400 |
That is also a signal for how good you were at that point. 00:27:11.720 |
- So, and then on a scale of one to five, one to three, 00:27:15.000 |
do they say how likely are you, or is it just a binary? 00:27:20.840 |
That's such a beautifully constructed challenge, okay. 00:27:23.480 |
You said the only way to make a smart assistant really smart 00:27:30.000 |
is to give it eyes and let it explore the world. 00:27:32.480 |
I'm not sure it might've been taken out of context, 00:27:40.080 |
'Cause I personally also find that idea super exciting 00:27:43.120 |
from a social robotics, personal robotics perspective. 00:27:46.240 |
- Yeah, a lot of things do get taken out of context. 00:27:48.840 |
This particular one was just as philosophical discussion 00:27:52.040 |
we were having on terms of what does intelligence look like? 00:27:59.200 |
I think just we said, we as humans are empowered 00:28:05.160 |
I do believe that eyes are an important aspect of it 00:28:09.560 |
in terms of, if you think about how we as humans learn, 00:28:13.680 |
it is quite complex, and it's also not unimodal 00:28:23.360 |
No, you learn by experience, you learn by seeing, 00:28:33.240 |
Machines on the contrary are very inefficient 00:28:44.360 |
not just less human, not just with less labeled data, 00:28:51.080 |
and where you can increase the learning rate. 00:28:55.160 |
I don't mean less data in terms of not having 00:29:23.920 |
So if you look at, you mentioned supervised learning, 00:29:28.000 |
from moving to more unsupervised, more weak supervision. 00:29:34.880 |
And I think in that setting, I hope you agree with me 00:29:43.520 |
- So absolutely, and from a machine learning perspective, 00:29:46.720 |
which I hope we get a chance to talk to a few aspects 00:29:57.560 |
has a very minimalistic, beautiful interface, 00:30:11.020 |
And nevertheless, we humans, so I have a Roomba, 00:30:15.720 |
I have all kinds of robots all over everywhere. 00:30:18.280 |
So what do you think the Alexa of the future looks like 00:30:23.280 |
if it begins to shift what his body looks like? 00:30:30.680 |
what do you think of the different devices in the home 00:30:33.800 |
as they start to embody their intelligence more and more? 00:30:38.120 |
Philosophically, a future, what do you think that looks like? 00:30:41.200 |
- I think, let's look at what's happening today. 00:30:43.600 |
You mentioned, I think, other devices as an Amazon devices, 00:30:48.040 |
Alexa is already integrated in a lot of third-party devices, 00:30:58.960 |
some in appliances that you use in everyday life. 00:31:02.600 |
So I think it's not just the shape Alexa takes 00:31:14.240 |
it's getting in different appliances in homes, 00:31:31.120 |
but I think it's also important to think of it, 00:31:37.200 |
that it is in multiple places at the same time. 00:31:40.280 |
So I think the actual embodiment in some sense, 00:31:46.700 |
I think you have to think of it as not as human-like 00:31:58.820 |
and how there are different ways to delight customers 00:32:03.980 |
And I think I'm a big fan of it not being just human-like, 00:32:08.980 |
it should be human-like in certain situations, 00:32:11.140 |
Alexa Price Social Bot in terms of conversation 00:32:14.900 |
but there are other scenarios where human-like, 00:32:18.820 |
I think is underselling the abilities of this AI. 00:32:22.080 |
- So if I could trivialize what we're talking about. 00:32:29.420 |
about the interaction with the device that Apple produced, 00:32:33.440 |
there was a extreme focus on controlling the experience 00:32:36.780 |
by making sure there's only this Apple produced devices. 00:32:54.260 |
The voice is the essential element of the interaction. 00:33:09.920 |
I think in terms of a huge scientific problem, 00:33:17.540 |
And especially if it's primarily voice what it is, 00:33:25.020 |
Now you're seeing just other behaviors of Alexa. 00:33:28.500 |
So I think we are in very early stages of what that means. 00:33:31.380 |
And this will be an important topic for the following years. 00:33:34.780 |
But I do believe that being able to recognize 00:33:40.500 |
is going to be important from an Alexa perspective. 00:33:43.380 |
I'm not speaking for the entire AI community, 00:33:49.460 |
And as we go into more of understanding who did what, 00:33:54.460 |
that identity of the AI is crucial in the coming world. 00:33:58.780 |
- I think from the broad AI community perspective, 00:34:02.900 |
So basically if I close my eyes and listen to the voice, 00:34:06.220 |
what would it take for me to recognize that this is Alexa? 00:34:09.620 |
- Or at least the Alexa that I've come to know 00:34:15.140 |
- Yeah, and the Alexa here in the US is very different 00:34:27.340 |
into a different culture, a different community, 00:34:29.060 |
but you traveled there, what do you recognize Alexa? 00:34:32.460 |
I think these are super hard questions actually. 00:34:34.820 |
- So there's a team that works on personality. 00:34:40.060 |
of what it means culturally speaking, India, UK, US, 00:34:45.580 |
So the problem that we just stated, which is fascinating, 00:34:48.460 |
how do we make it purely recognizable that it's Alexa? 00:34:52.680 |
Assuming that the qualities of the voice are not sufficient, 00:35:11.620 |
who from both the UX background and human factors 00:35:14.140 |
are looking at these aspects and these exact questions. 00:35:17.500 |
But I will definitely say it's not just how it sounds, 00:35:36.220 |
how terse are you or how lengthy in your explanations you are 00:35:42.980 |
And you also, you mentioned something crucial 00:35:53.460 |
So you as your individual, how you prefer Alexa sounds 00:36:01.260 |
And we may, and the amount of customized ability 00:36:03.780 |
you want to give is also a key debate we always have. 00:36:19.740 |
in terms of how you raise your pitch and so forth. 00:36:29.460 |
inside of the Alexa team of how much personalization 00:36:34.380 |
'Cause you're taking a risk if you over personalize 00:36:37.260 |
because you don't, if you create a personality 00:36:42.020 |
for a million people, you can test that better. 00:36:53.500 |
the less you can know that it's a great experience. 00:36:56.340 |
So how much personalization, what's the right balance? 00:36:59.700 |
- I think the right balance depends on the customer. 00:37:02.780 |
So I'll say, I think the more control you give customers, 00:37:09.580 |
And I'll give you some key personalization features. 00:37:13.860 |
I think we have a feature called Remember This, 00:37:15.860 |
which is where you can tell Alexa to remember something. 00:37:26.500 |
- What kind of things would that be used for? 00:37:33.260 |
because it's so hard to go and find and see what it is 00:37:43.100 |
where I'm sometimes just looking at it and it's not handy. 00:37:45.940 |
So those are my own personal choices I've made 00:37:49.940 |
for Alexa to remember something on my behalf. 00:37:56.020 |
about how you provide that to a customer as a control. 00:38:00.020 |
So I think these are the aspects of what you do. 00:38:12.980 |
and this person in your household is person two, 00:38:32.220 |
through explicit control right now through your app 00:38:34.620 |
that your multiple service providers, let's say for music, 00:38:41.300 |
depend on your, whether you have preferred Spotify 00:38:45.700 |
that the decision is made where to play it from. 00:38:48.300 |
- So what's Alexa's backstory from her perspective? 00:38:52.380 |
I remember just asking as probably a lot of us 00:38:58.460 |
are just the basic questions about love and so on of Alexa, 00:39:03.820 |
Just, it feels like there's a little bit of a back, 00:39:08.580 |
feels like there's a little bit of personality, 00:39:31.200 |
- I think, well, it does tell you if I think you, 00:39:41.520 |
- I think you do, 'cause I think I've tested that. 00:39:51.280 |
- So on terms of the metaphysical, I think it's early. 00:39:55.760 |
Does it have the historic knowledge about herself 00:40:15.800 |
and I bring this back to the Alexa Prize Social Bot one, 00:40:28.480 |
that some academia team may think of these problems 00:40:50.480 |
in terms of a customer perspective, a product. 00:40:54.480 |
If you want to create a product that's useful. 00:40:57.120 |
By dangerous, I mean creating an experience that upsets me. 00:41:11.800 |
but if you look at the human to human relationship, 00:41:15.040 |
some of our deepest relationships have fights, 00:41:29.480 |
- So there's one other common thing that you didn't say, 00:41:32.480 |
but we think of it as paramount for any deep relationship. 00:41:38.640 |
- So I think if you trust every attribute you said, 00:41:46.040 |
But what is sort of unnegotiable in this instance is trust. 00:41:51.040 |
And I think the bar to earn customer trust for AI 00:41:54.440 |
is very high, in some sense, more than a human. 00:41:58.000 |
It's not just about personal information or your data. 00:42:03.000 |
It's also about your actions on a daily basis. 00:42:06.600 |
How trustworthy are you in terms of consistency, 00:42:09.400 |
in terms of how accurate are you in understanding me? 00:42:12.640 |
Like if you're talking to a person on the phone, 00:42:22.560 |
That whole example gets amplified by a factor of 10, 00:42:25.920 |
because when you're a human interacting with an AI, 00:42:33.560 |
and then you get upset, why is it behaving this way? 00:42:42.480 |
So I think we grapple with these hard questions as well, 00:42:45.240 |
but I think the key is actions need to be trustworthy 00:42:49.120 |
from these AIs, not just about data protection, 00:43:04.440 |
but trust is such a high bar with AI systems, 00:43:10.920 |
the bar that's placed on AI system is unreasonably high. 00:43:14.800 |
- Yeah, that is going to be, I agree with you, 00:43:20.480 |
- It's a challenge, and it also keeps my job. 00:43:42.080 |
So I think that's the trade-off we have to balance, 00:43:54.200 |
in accuracy and mistakes than we hold humans? 00:43:57.000 |
That's going to be a great societal questions 00:44:06.200 |
I think a lot of people in the AI think about a lot, 00:44:17.360 |
to any AI system can be used to enrich our lives 00:44:25.800 |
So if basically any product that does anything awesome 00:44:37.080 |
people imagine the worst case possible scenario 00:44:42.240 |
People, it boils down to trust, as you said before. 00:44:47.240 |
of in certain groups of governments and so on, 00:44:50.440 |
depending on the government, depending on who's in power, 00:44:55.400 |
And so here's Alex in the middle of all of it, 00:44:57.960 |
in the home, trying to do good things for the customers. 00:45:02.320 |
So how do you think about privacy in this context 00:45:08.640 |
- Absolutely, so as you said, trust is the key here. 00:45:16.720 |
It has to be designed from very beginning about that. 00:45:20.200 |
And we believe in two fundamental principles. 00:45:28.880 |
when we build what is now called smart speaker 00:45:33.320 |
we were quite judicious about making these right trade-offs 00:45:45.240 |
when it has heard you say the word wake word, 00:45:51.320 |
we also had, we put a physical mute button on it, 00:45:55.480 |
just so if you didn't want it to be listening, 00:46:11.720 |
we gave the control in the hands of the customers 00:46:14.880 |
any of your individual utterances that is recorded 00:46:19.560 |
And we have kept true to that promise, right? 00:46:24.960 |
a great instance of showing how you have the control. 00:46:33.080 |
So that is now making it even just more control 00:46:44.400 |
So these are the types of decisions we continually make. 00:46:48.040 |
We just recently launched this feature called, 00:46:55.720 |
because you've mentioned supervised learning, right? 00:47:03.760 |
And that also is now a feature where you can, 00:47:13.600 |
that we have to constantly offer with customers. 00:47:17.440 |
- So why do you think it bothers people so much that, 00:47:22.840 |
so everything you just said is really powerful. 00:47:28.360 |
'cause we collect, we have studies here running at MIT 00:47:34.820 |
The ability to delete that data is really empowering 00:47:39.980 |
but the ability to have that control is really powerful. 00:47:56.080 |
And all of a sudden, they'll have advertisements 00:48:29.640 |
So you choose one and it listens only for that 00:48:36.480 |
we have to be very clear that it's just the wake word. 00:48:38.360 |
So you said, why is there this anxiety, if you may? 00:48:45.320 |
And I think it's partly on us to keep educating 00:49:04.000 |
there's always a hunger for information and clarity. 00:49:06.640 |
And we'll constantly look at how best to communicate. 00:49:15.320 |
And I think that's absolutely okay to question. 00:49:21.720 |
because our fundamental philosophy is customer first, 00:49:24.840 |
customer obsession is our leadership principle. 00:49:33.160 |
and all decisions in Amazon are made with that. 00:49:38.000 |
and we have to keep earning the trust of our customers 00:49:44.040 |
is there something showing up based on your conversations? 00:49:49.600 |
a lot of times when those experiences happen, 00:49:51.360 |
you have to also know that, okay, it may be a winter season. 00:49:56.480 |
And it shows up on your amazon.com because it is popular. 00:50:01.480 |
you mentioned that personality or personalization, 00:50:06.360 |
turns out we are not that unique either, right? 00:50:27.200 |
But for my, let me just say from my perspective, 00:50:29.240 |
I hope there's a day when customer can ask Alexa 00:50:33.200 |
to listen all the time, to improve the experience, 00:50:36.680 |
to improve, because I personally don't see the negative 00:50:39.840 |
because if you have the control and if you have the trust, 00:50:43.960 |
there's no reason why you shouldn't be listening 00:50:45.680 |
all the time to the conversations to learn more about you. 00:50:48.340 |
Because ultimately, as long as you have control and trust, 00:50:56.940 |
that the device wants, is going to be useful. 00:51:05.140 |
I think it worries me how sensitive people are 00:51:09.540 |
about their data relative to how empowering it could be 00:51:25.460 |
So I just, it's something I think about sort of a lot, 00:51:29.580 |
obviously Alexa thinks about it a lot as well. 00:51:34.260 |
So have you seen, let me ask it in the form of a question. 00:51:37.180 |
Have you seen an evolution in the way people think about 00:51:42.260 |
their private data in the previous several years? 00:51:46.420 |
So as we as a society get more and more comfortable 00:51:48.740 |
with the data, how do we get more and more comfortable 00:51:51.540 |
with the benefits we get by sharing more data? 00:51:57.780 |
And then I'll wanna go back to the other aspect 00:52:03.860 |
we are getting more comfortable as a society. 00:52:12.940 |
is always gonna be the answer for all, right? 00:52:17.180 |
Going back to your on what more magical experiences 00:52:22.180 |
can be launched in these kinds of AI settings. 00:52:38.300 |
after you've spoken to it, will open the mics again, 00:52:44.660 |
Like if you're adding lists to your shopping item, 00:52:48.540 |
shopping list or to-do list, you're not done. 00:52:57.140 |
So these are the kinds of things which you can empower. 00:53:04.980 |
I said, it only listens for the wake word, all right? 00:53:07.780 |
But if you have, let's say you're going to say, 00:53:11.220 |
you leave your home and you want Alexa to listen 00:53:19.300 |
So it's like just to keep your peace of mind. 00:53:26.500 |
and then it can be listening for these sound events. 00:53:29.220 |
And when you're home, you come out of that mode, right? 00:53:33.020 |
So this is another one where you again gave controls 00:53:38.060 |
and to enable some experience that is high utility 00:53:42.460 |
and maybe even more delightful in the certain settings 00:53:50.780 |
- So I know we kind of started with a lot of philosophy 00:54:03.020 |
is in the algorithm side, the data side, the technology, 00:54:06.180 |
the deep learning, machine learning and so on. 00:54:18.660 |
how it came to be, how it has grown, where it is today? 00:54:27.020 |
and we have a process called working backwards. 00:54:30.340 |
Alexa and more specifically than the product Echo, 00:54:38.900 |
started with a very simple vision statement, for instance, 00:54:47.180 |
along the way changed into what all it can do, right? 00:54:51.740 |
But the inspiration was the Star Trek computer. 00:54:56.260 |
everything is possible, but when you launch a product, 00:55:01.100 |
And when I joined, the product was already in conception 00:55:05.540 |
and we started working on the far field speech recognition 00:55:10.980 |
By that, we mean that you should be able to speak 00:55:15.260 |
And in those days, that wasn't a common practice. 00:55:18.860 |
And even in the previous research world I was in 00:55:24.620 |
in terms of whether you can converse from a length. 00:55:28.340 |
And here I'm still talking about the first part 00:55:37.140 |
which means the word Alexa has to be detected 00:55:40.380 |
with a very high accuracy because it is a very common word. 00:55:44.860 |
It has sound units that map with words like I like you 00:55:56.140 |
the right mentions of Alexa's address to the device 00:56:06.060 |
- Not only noise, but a lot of conversation in the house. 00:56:10.300 |
you're simply listening for the wake word, Alexa. 00:56:13.180 |
And there's a lot of words being spoken in the house. 00:56:15.780 |
How do you know it's Alexa and directed at Alexa? 00:56:20.780 |
Because I could say, I love my Alexa, I hate my Alexa, 00:56:26.980 |
And in all these three sentences I said Alexa, 00:56:33.780 |
What would be your device that I should probably 00:56:36.740 |
in the introduction of this conversation give to people 00:56:39.980 |
in terms of them turning off their Alexa device 00:56:43.500 |
if they're listening to this podcast conversation out loud? 00:56:52.300 |
Because we mentioned Alexa like a million times. 00:56:55.180 |
- So it will, we have done a lot of different things 00:56:58.140 |
where we can figure out that there is the device, 00:57:03.140 |
the speech is coming from a human versus over the air. 00:57:10.580 |
think about ads or, so we also launched a technology 00:57:18.820 |
But yes, if this kind of a podcast is happening, 00:57:21.620 |
it's possible your device will wake up a few times. 00:57:25.460 |
but it is definitely something we care very much about. 00:57:37.580 |
versus I like something, I mean, that's a fascinating part. 00:57:49.980 |
not like something where the phone is sitting on the table. 00:57:53.900 |
This is like people have devices 40 feet away, 00:58:02.500 |
The next is, okay, you're speaking to the device. 00:58:05.900 |
Of course, you're going to issue many different requests. 00:58:09.020 |
Some may be simple, some may be extremely hard, 00:58:11.580 |
but it's a large vocabulary speech recognition problem, 00:58:13.780 |
essentially, where the audio is now not coming 00:58:28.860 |
your daughter may be running around with something 00:58:40.180 |
need to be recognized with very high accuracy. 00:58:43.380 |
Now we are still just in the recognition problem. 00:58:45.820 |
We haven't yet come to the understanding one. 00:58:51.180 |
Is this before neural networks began to start 00:58:55.500 |
to seriously prove themselves in the audio space? 00:59:00.540 |
- Yeah, this is around, so I joined in 2013, in April. 00:59:05.540 |
So the early research in neural networks coming back 00:59:11.380 |
in speech recognition space had started happening, 00:59:17.940 |
on the very first thing we did when I joined the team. 00:59:22.940 |
And remember, it was a very much of a startup environment, 00:59:31.380 |
And we knew we'll have to improve accuracy fast. 00:59:40.020 |
once you have a device like this, if it is successful, 00:59:45.060 |
Like you'll suddenly have large volumes of data 00:59:48.180 |
to learn from to make the customer experience better. 01:00:10.100 |
to be able to train on thousands and thousands of speech. 01:00:22.580 |
the combination of large scale data, deep learning progress, 01:00:34.940 |
to be able to solve the far field speech recognition 01:00:38.540 |
to the extent it could be useful to the customers. 01:00:44.620 |
but we are great at it in terms of the settings 01:00:48.460 |
So, and that was important even in the early stages. 01:00:57.100 |
it seems like the task would be pretty daunting. 01:01:20.820 |
How likely were you to fail in the eyes of everyone else? 01:01:28.860 |
- I'll give you a very interesting anecdote on that. 01:01:37.740 |
My first meeting, and we had hired a few more people, 01:01:42.620 |
Nine out of 10 people thought it can't be done. 01:02:04.860 |
like either telephony speech for customer service calls 01:02:09.780 |
But this was the kind of belief you must have. 01:02:11.820 |
And I had experience with far field speech recognition 01:02:14.100 |
and my eyes lit up when I saw a problem like that saying, 01:02:25.540 |
to bring something delightful in the hands of customers. 01:02:28.540 |
- You mentioned the way you kind of think of it 01:02:32.380 |
have a press release and an FAQ and you think backwards. 01:02:35.820 |
- Did you have, did the team have the echo in mind 01:02:43.100 |
actually putting a thing in the home that works, 01:02:51.500 |
as I said, the vision was Star Trek computer, right? 01:02:56.940 |
And from there, I can't divulge all the exact specifications 01:03:00.660 |
but one of the first things that was magical on Alexa 01:03:11.180 |
because my taste was still in when I was an undergrad. 01:03:15.580 |
and it was too hard for me to be a music fan with a phone. 01:03:36.100 |
in terms of how far are we from the original vision? 01:03:44.500 |
because every day we go in and thinking like, 01:03:47.180 |
these are the new set of challenges to solve. 01:03:49.020 |
- Yeah, it's a great way to do great engineering 01:03:56.780 |
but it's just a super nice way to have a focus. 01:04:01.340 |
and a lot of my scientists have adopted that. 01:04:10.940 |
but they are all after you've done the research 01:04:13.540 |
or you've proven and your PhD dissertation proposal 01:04:21.220 |
is the closest that comes to a press release. 01:04:23.620 |
But that process is now ingrained in our scientists 01:04:29.820 |
- You write the paper first and then make it happen. 01:04:38.460 |
where you have a thesis about here's what I expect. 01:04:48.180 |
- So far field recognition, what was the big leap? 01:05:05.460 |
And what we first did was got a lot of training data 01:05:16.180 |
So how do you collect data in far field setup? 01:05:31.860 |
what would magical mean in this kind of a setting? 01:05:37.500 |
That's always, since you've never done this before, 01:05:57.460 |
where it is given you have no customers right now? 01:06:11.860 |
And I can just tell you that the combination of the two 01:06:39.460 |
That we felt would be where people will use it, 01:06:48.820 |
If we had launched in November 2014 is when we launched, 01:06:58.020 |
- Yeah, and just having looked at voice-based interactions 01:07:05.940 |
it's a source of huge frustration for people. 01:07:10.260 |
for collecting data on subjects to measure frustration. 01:07:14.540 |
So as a training set for computer vision, for face data, 01:07:18.180 |
so we can get a data set of frustrated people. 01:07:22.220 |
is having them interact with a voice-based system in the car. 01:07:29.420 |
And we talked about how also errors are perceived 01:07:35.340 |
But we are not done with the problems that ended up, 01:08:03.940 |
but for these multiple domains like music, like information, 01:08:10.020 |
other kinds of household productivity, alarms, timers, 01:08:27.900 |
So now you're looking at meaning understanding 01:08:31.860 |
on behalf of customers based on their requests. 01:08:37.900 |
Even if you have gotten the words recognized, 01:08:44.060 |
In those days, there was still a lot of emphasis 01:08:48.900 |
on rule-based systems for writing grammar patterns 01:08:53.860 |
but we had a statistical first approach even then, 01:09:01.300 |
an entity recognizer and an intent classifier, 01:09:08.100 |
In fact, we had to build the deterministic matching 01:09:18.180 |
where we focused on data-driven statistical understanding. 01:09:21.980 |
- Wins in the end if you have a huge data set. 01:09:26.380 |
And that's why it came back to how do you get the data. 01:09:29.060 |
Before customers, the fact that this is why data 01:09:32.460 |
becomes crucial to get to the point that you have 01:09:42.700 |
we were talking about human-machine dialogue, 01:09:49.180 |
do one thing, one shot at transits in great way. 01:09:52.460 |
There was a lot of debate on how much should Alexa talk back 01:10:15.460 |
Stone Temple Pilots or Rolling Stones, right? 01:10:27.100 |
- UX, like what kind of, yeah, how do you solve that problem? 01:10:40.980 |
to whether it's the Stones or the Stone Temple Pilots 01:10:47.140 |
the job of the algorithm, or is the job of UX 01:10:50.580 |
communicating with the human to help the resolution? 01:10:58.820 |
without any further questioning or UX, right? 01:11:01.260 |
So, but it's absolutely okay, just like as humans, 01:11:16.260 |
with more self-learning with these kinds of feedback signals. 01:11:23.300 |
of understanding the intent and resolving to an action, 01:11:26.500 |
where action could be play a particular artist 01:11:31.980 |
Again, the bar was high as we were talking about, right? 01:11:35.460 |
So while we launched it in sort of 13 big domains, 01:11:40.340 |
I would say in terms of, or we think of it as 13, 01:11:43.420 |
the big skills we had, like music is a massive one 01:11:57.740 |
- So we think of it as music information, right? 01:12:05.500 |
So when we launched, we didn't have smart home, 01:12:26.380 |
and you don't, we also have this Echo plug, which is. 01:12:40.340 |
and we have gone, make Alexa more and more proactive 01:12:45.660 |
or looks, hunches like you left your light on. 01:12:52.940 |
So yeah, it will help you out in these settings, right? 01:12:58.420 |
- Information, smart devices, you said music. 01:13:01.180 |
- Yeah, so I don't remember everything we had. 01:13:05.060 |
Like that was, you know, the timers were very popular 01:13:09.540 |
Music also, like you could play song, artist, album, 01:13:19.460 |
So that's, again, this is language understanding. 01:13:24.140 |
So where we want Alexa definitely to be more accurate, 01:13:28.420 |
competent, trustworthy based on how well it does 01:13:33.140 |
But we have evolved in many different dimensions. 01:13:35.300 |
First is what I think of it doing more conversational 01:13:40.980 |
And there at Remars this year, which is our AI conference, 01:13:44.940 |
we launched what is called Alexa Conversations. 01:13:58.900 |
Initially it was like, you know, all these IVR systems, 01:14:02.620 |
you have to fully author if the customer says this, 01:14:14.380 |
that you just provide a sample interaction data 01:14:16.780 |
with your service or an API, let's say your Atom tickets 01:14:19.140 |
that provides a service for buying movie tickets. 01:14:23.420 |
You provide a few examples of how your customers 01:14:27.820 |
And then the dialogue flow is automatically constructed 01:14:29.980 |
using a record neural network, trained on that data. 01:14:35.940 |
We just launched our preview for the developers 01:14:42.140 |
which shows even increased utility for customers, 01:14:53.180 |
the goal is often unclear or unknown to the AI. 01:14:58.180 |
If I say, Alexa, what movies are playing nearby? 01:15:12.860 |
whether the Avengers is still in theater or when is it? 01:15:15.900 |
Maybe it's gone and maybe it will come on my missed it. 01:15:18.460 |
So I may watch it on Prime, which happened to me. 01:15:28.460 |
And let's say I now complete the movie ticket purchase. 01:15:47.980 |
So can Alexa now figure we have the intelligence 01:15:52.540 |
that I think this meta goal is really night out 01:15:57.580 |
when you've completed the purchase of movie tickets 01:16:00.020 |
from Atom Tickets or Fandango or Piccu or anyone, 01:16:03.260 |
then the next thing is, do you want to get an Uber 01:16:10.820 |
Or do you want to book a restaurant next to it? 01:16:14.420 |
And then not ask the same information over and over again, 01:16:18.980 |
what time, how many people in your party, right? 01:16:23.980 |
So this is where you shift the cognitive burden 01:16:35.540 |
and takes the next best action to complete it. 01:16:42.140 |
But essentially the way we solve this first instance 01:16:45.180 |
and we have a long way to go to make it scale 01:17:03.780 |
whether either you have completed the interaction 01:17:07.740 |
So it will shift context into another experience or skill. 01:17:15.340 |
That's making Alexa, you can say more conversational 01:17:34.260 |
Intent modeling is predicting what your possible goals are 01:17:39.980 |
and switching that depending on the things you say. 01:17:46.500 |
but it would help a lot if Alexa remembered me, 01:17:53.860 |
- Is it trying to use some memory for the customer? 01:17:58.380 |
- Yeah, it is using a lot of memory within that. 01:18:06.820 |
but within the short-term memory, within the session, 01:18:18.300 |
you need at least four seats at a restaurant, right? 01:18:21.740 |
So these are the kind of contexts it's preserving 01:18:24.300 |
between these skills, but within that session. 01:18:26.820 |
But you're asking the right question in terms of 01:18:47.060 |
Like I eat the same, I do everything the same, 01:18:50.340 |
the same thing, wear the same thing, clearly, 01:18:55.540 |
So it's frustrating when Alexa doesn't get what I'm saying 01:19:08.420 |
And doesn't know, I've complained to Spotify about this, 01:19:14.060 |
Stairway to Heaven, I have to correct it every time. 01:19:22.540 |
- You should figure, you should send me your, 01:19:24.940 |
next time it fails, feel free to send it to me, 01:19:29.300 |
- Because Led Zeppelin is one of my favorite brands 01:19:35.500 |
I'll make it public, make everybody retweet it. 01:19:39.060 |
We're gonna fix the Stairway to Heaven problem. 01:19:44.340 |
and do the same things, but I'm sure most people 01:19:48.380 |
Do you see Alexa sort of utilizing that in the future 01:19:56.220 |
We call it, where Alexa is becoming more self-learning. 01:19:59.580 |
So Alexa is now auto-correcting millions and millions 01:20:04.420 |
of utterances in the US without any human supervision 01:20:15.740 |
You either, it played the wrong song and you said, 01:20:20.780 |
Or you say, Alexa, play that, you try it again. 01:20:25.220 |
And that is a signal to Alexa that she may have done 01:20:33.540 |
if there's that failure pattern or that action 01:20:36.740 |
of song A was played when song B was requested. 01:20:43.100 |
because play NPR, you can have N be confused as an M 01:20:47.220 |
and then you, for a certain accent like mine, 01:21:01.660 |
And in that part, but it starts auto-correcting 01:21:12.740 |
- So one of the things that's for me missing in Alexa, 01:21:17.420 |
I don't know if I'm a representative customer, 01:21:19.780 |
but every time I correct it, it would be nice to know 01:21:33.860 |
- We work a lot with Tesla, we study autopilot and so on. 01:21:40.740 |
they feel like they're always teaching the system. 01:21:45.300 |
I don't know if Alexa customers generally think of it 01:21:57.340 |
and some would be annoyed by Alexa acknowledging that. 01:22:08.140 |
But we believe that again, customers helping Alexa 01:22:20.100 |
There is no human in the loop and no labeling happening. 01:22:35.780 |
is gonna get bigger and bigger in the whole space, 01:22:53.220 |
And we have done a lot of advances in our text to speech 01:23:05.580 |
to the timing, the tonality, the tone, everything. 01:23:10.980 |
there's a lot of controls in each of the places 01:23:24.380 |
And we do a ton of listening tests to make sure. 01:23:27.100 |
But naturalness, how it sounds should be very natural. 01:23:30.740 |
How it understands requests is also very important. 01:23:33.660 |
Like, and in terms of, like, we have 95,000 skills, 01:23:37.140 |
and if we have, imagine that in many of these skills, 01:23:43.340 |
And say, Alexa, ask the tied skill to tell me X. 01:23:49.300 |
Or, now, if you have to remember the skill name, 01:23:52.660 |
that means the discovery and the interaction is unnatural. 01:23:56.340 |
And we are trying to solve that by what we think of as, 01:24:00.620 |
again, this was, you don't have to have the app metaphor here. 01:24:07.140 |
Even though they're, so you're not sort of opening 01:24:17.260 |
independent of the specificity, like a skill name. 01:24:34.580 |
And then you can rank the responses from the skill 01:24:38.020 |
and then choose the best response for the customer. 01:24:58.900 |
So that works, that helps with the naturalness. 01:25:02.700 |
Like if you said, you can say, Alexa, remember, 01:25:10.860 |
through your calendar that's linked to Alexa. 01:25:13.180 |
You don't want to remember whether it's in my calendar 01:25:23.060 |
independent of how customers create these events, 01:25:29.460 |
And it tells you when you have to go to mom's house. 01:25:37.100 |
Who's tasked with integrating all of that knowledge together 01:25:46.100 |
Or is it an infrastructure that Alexa provides problem? 01:25:51.660 |
I think the large problem in terms of making sure 01:26:29.660 |
or Alexa, ask Domino's to get a particular type of pizza, 01:26:41.980 |
That latter part is definitely our responsibility 01:26:44.540 |
in terms of when the request is not fully specific, 01:26:51.060 |
or a service that can fulfill the customer's request? 01:26:59.900 |
that the goal could be more than that individual request 01:27:11.860 |
so this is, welcome to the world of conversational AI. 01:27:17.740 |
because it's not the academic problem of NLP, 01:27:20.340 |
of natural language processing, understanding, dialogue. 01:27:41.540 |
What are the problems that really need to be solved 01:27:58.460 |
needs to work magically, no question about that. 01:28:06.660 |
and not do that, that is unacceptable as a customer, right? 01:28:10.300 |
So that, you have to get the foundational understanding 01:28:15.020 |
The second aspect, when I said more conversational, 01:28:19.740 |
It is really about figuring out what the latent goal is 01:28:23.940 |
of the customer based on what I have the information now, 01:28:28.100 |
and the history, and what's the next best thing to do. 01:28:30.940 |
So that's a complete reasoning and decision making problem. 01:28:39.620 |
Your environment is super hard and self-driving, 01:28:48.140 |
But if you think about how many decisions Alexa is making 01:29:09.660 |
So any given instance, then it's really a decision 01:29:16.820 |
Alexa has to determine what's the best thing it needs to do. 01:29:22.140 |
about decisions based on the information you have. 01:29:30.700 |
Do you think, and we touched this topic a little bit earlier 01:29:38.500 |
to help improve the quality of the hunch it has, 01:29:53.940 |
- I mean, let me again bring back to what it already does. 01:29:56.740 |
We talked about how based on you barge in over Alexa, 01:30:08.140 |
The next extension of whether frustration is a signal or not, 01:30:19.140 |
- You can get from voice, but it's very hard. 01:30:20.900 |
Like, I mean, frustration as a signal, historically, 01:30:25.540 |
if you think about emotions of different kinds, 01:30:28.060 |
you know, there's a whole field of affective computing, 01:30:31.020 |
something that MIT has also done a lot of research in, 01:30:35.180 |
And you're now talking about a far field device, 01:30:38.620 |
as in you're talking to a distance, noisy environment. 01:30:43.660 |
it needs to have a good sense for your emotions. 01:30:49.860 |
but you haven't shied away from hard problems. 01:30:58.260 |
deep learning approaches to solving the hardest aspects 01:31:10.300 |
a lot of folks are now starting to work in reasoning, 01:31:13.460 |
trying to see how we can make neural networks reason. 01:31:16.180 |
Do you see that new approaches need to be invented 01:31:22.940 |
- Absolutely, I think there has to be a lot more investment 01:31:35.740 |
or like zero-shot learning, one-shot learning. 01:31:39.300 |
- And the active learning stuff you've talked about 01:31:42.780 |
- So transfer learning is also super critical, 01:31:45.300 |
especially when you're thinking about applying knowledge 01:31:48.220 |
from one task to another or one language to another, right? 01:32:14.180 |
is going to be key for our next wave of the technology. 01:32:24.340 |
that a lot of it can be done by prediction tasks. 01:32:37.220 |
So that's just, I wanted to sort of point that out. 01:32:39.380 |
- So creating a rich, fulfilling, amazing experience 01:32:46.300 |
because it does awesome things, deep learning is enough. 01:32:50.980 |
- I don't think, no, I wouldn't say deep learning is enough. 01:32:58.340 |
I'm saying there are still a lot of things we can do 01:33:02.100 |
with prediction-based approaches that do not reason. 01:33:05.060 |
Right, I'm not saying that, and we haven't exhausted those. 01:33:14.140 |
of what Alexa needs to do, reasoning has to be solved 01:33:30.060 |
But reasoning, we have very, very early days. 01:33:41.660 |
the hypothesis space is really, really large. 01:33:47.700 |
And when you go back in time, like you were saying, 01:33:53.180 |
that once you go beyond a session of interaction, 01:33:56.460 |
which is by session, I mean a time span, which is today, 01:34:00.740 |
to versus remembering which restaurant I like. 01:34:03.300 |
And then when I'm planning a night out to say, 01:34:28.300 |
of interacting with Alexa, you think that space is huge? 01:34:33.100 |
- Do you think, so like another sort of devil's advocate 01:34:36.100 |
would be that we human beings are really simple 01:34:56.140 |
of the interactions, it feels like are clustered 01:35:01.140 |
in groups that don't require general reasoning. 01:35:06.140 |
- I think, yeah, you're right in terms of the head 01:35:09.420 |
of the distribution of all the possible things 01:35:32.420 |
I mean, if you're an average surfer, which I am not, 01:35:36.060 |
but somebody is asking Alexa about surfing conditions, 01:35:41.780 |
and there's a skill that is there for them to get to, right? 01:35:50.820 |
people have created, it's humongous in terms of it. 01:35:54.300 |
And which means there are these diverse needs. 01:35:57.060 |
And when you start looking at the combinations of these, 01:36:01.020 |
even if you had pairs of skills and 90,000 choose two, 01:36:11.780 |
And I think customers are wonderfully frustrated 01:36:27.020 |
So you've mentioned the idea of a press release, 01:36:37.300 |
and you kind of make it happen, you work backwards. 01:36:40.060 |
So can you draft for me, you probably already have one, 01:36:43.900 |
but can you make up one for 10, 20, 30, 40 years out 01:36:52.740 |
just in broad strokes, something that you dream about? 01:36:56.500 |
- I think let's start with the five years first. 01:37:00.940 |
So, and I'll get to the 40 is too hard to pick. 01:37:03.700 |
- 'Cause I'm pretty sure you have a real five year one. 01:37:08.300 |
But yeah, in broad strokes, let's start with five years. 01:37:29.020 |
So I think from the next five years perspective, 01:37:37.100 |
is that notion which you said, goal-oriented dialogues 01:38:00.340 |
How long does it take for you to buy a camera? 01:38:11.500 |
when somebody says, "Alexa, find me a camera?" 01:38:17.620 |
Right, so even in the something that you think of it 01:38:20.460 |
as shopping, which you said you yourself use a lot of, 01:38:27.420 |
or items where you sort of are not brand conscious 01:38:38.140 |
that I haven't bought before on Amazon on the desktop 01:38:41.260 |
after I clicked on a bunch of reviews, that kind of stuff. 01:38:45.900 |
So now you think in, even for something that you felt like 01:38:52.700 |
because even products, the attributes are many. 01:39:07.020 |
So that's just shopping where you could argue 01:39:15.420 |
Alexa, what's the weather in Cape Cod this weekend? 01:39:18.660 |
Right, so why am I asking that weather question, right? 01:39:22.580 |
So I think of it as how do you complete goals 01:39:32.460 |
the distinction between goal-oriented and conversations 01:39:45.860 |
or I'm looking at who's winning the debates, right? 01:39:59.900 |
- And you're optimistic 'cause that's a hard problem. 01:40:04.260 |
- The reasoning enough to be able to help explore 01:40:09.260 |
complex goals that are beyond something simplistic. 01:40:12.300 |
That feels like it could be, well, five years is a nice-- 01:40:28.100 |
And will we solve all of it in the five-year space? 01:40:40.780 |
on a 40-year horizon, because even if you limit 01:40:50.300 |
to neural processing, to how brain stores information 01:41:03.020 |
So I wanted to start, that's why at the five-year, 01:41:06.460 |
because the five-year success would look like that 01:41:11.340 |
And the 40-year would be where it's just natural 01:41:14.660 |
to talk to these in terms of more of these complex goals. 01:41:20.100 |
where these transactions you mentioned of asking 01:41:30.980 |
It's now unnatural to pick up your phone, right? 01:41:34.020 |
And that I think is the first five-year transformation. 01:41:51.940 |
you're part of a large team that's creating a system 01:42:02.780 |
So we human beings, we these descendants of apes, 01:42:06.140 |
have created an artificial intelligence system 01:42:13.060 |
the two most transformative robots of this century, 01:42:42.980 |
What is your feeling about the whole mess of it? 01:43:00.860 |
Like I've been in this space for a long time in Cambridge, 01:43:03.860 |
and it's so heartwarming to see the kind of adoption 01:43:16.220 |
Because we are unable to find the skill or application 01:43:21.300 |
that would not simply be a good to have thing 01:43:26.060 |
And it's so fulfilling to see it make a difference 01:43:29.140 |
to millions and billions of people worldwide. 01:43:32.180 |
The good thing is that it's still very early. 01:43:49.580 |
and I would say not just launching Alexa in 2014, 01:44:06.620 |
are having this conversation where I'm not saying, 01:44:10.300 |
oh, Lex, planning a night out with an AI agent, impossible. 01:44:17.100 |
And not only possibility, we'll be launching this, right? 01:44:25.580 |
Once you have these kinds of agents out there being used, 01:44:36.540 |
we are throwing out at our budding researchers 01:44:41.780 |
And the great thing is you can now get immense satisfaction 01:44:47.220 |
not just a paper in NeurIPS or another conference. 01:44:54.780 |
So I don't think there's a better place to end. 01:45:05.700 |
And thank you to our presenting sponsor, Cash App. 01:45:16.500 |
that inspires hundreds of thousands of young minds 01:45:19.700 |
to learn and to dream of engineering our future. 01:45:23.260 |
If you enjoy this podcast, subscribe on YouTube, 01:45:28.140 |
support it on Patreon, or connect with me on Twitter. 01:45:31.660 |
And now let me leave you with some words of wisdom 01:45:37.420 |
"Sometimes it is the people no one can imagine anything of 01:45:44.260 |
Thank you for listening and hope to see you next time.