back to indexNuts and Bolts of Applying Deep Learning (Andrew Ng)
00:00:00.000 |
>> So when we're organizing this workshop, my co-organizers initially asked me, 00:00:05.120 |
hey Andrew, end of the first day, go give a visionary talk. 00:00:07.680 |
So until several hours ago, my talk was advertised as visionary talk. 00:00:13.920 |
I was preparing for this presentation over the last several days. 00:00:18.640 |
I've tried to think what would be the most useful information to you, and 00:00:23.320 |
what are the things that you could take back to work on Monday and 00:00:26.040 |
do something different at your job next Monday. 00:00:28.480 |
And I thought that we need to set context right now, as Peter mentioned, 00:00:33.480 |
So it's a team of about 1,000 people working on vision, speech, NLP, 00:00:42.680 |
instead of taking the shiniest pieces of deep learning that I know, 00:00:46.000 |
I want to take the lessons that I saw at Baidu that are common to so 00:00:50.280 |
many different academic areas, as well as applications, 00:00:54.000 |
autonomous cars, augmented reality, advertising, web search, medical diagnosis. 00:00:59.360 |
We'll take what I saw, the common lessons, the simple, 00:01:02.080 |
powerful ideas that I've seen, how to drive a lot of machine learning progress at Baidu. 00:01:06.720 |
And I thought I will share those ideas with you, 00:01:09.200 |
because the patterns I see across a lot of projects, I thought might be the patterns 00:01:15.440 |
whatever you are working on in the next several weeks or months. 00:01:19.480 |
So one common theme that will appear in this presentation today is that 00:01:25.360 |
the workflow of organizing machine learning projects feels like 00:01:28.920 |
parts of it are changing in the era of deep learning. 00:01:31.840 |
So for example, one of the ideas I'll talk about is bias-variance. 00:01:36.520 |
And then many of you, maybe all of you have heard of bias-invariance. 00:01:42.760 |
I feel like there have been some changes to the way we think about bias-invariance. 00:01:46.280 |
So I wanna talk about some of these ideas, which maybe aren't even deep learning per 00:01:49.920 |
se, but have been slowly shifting as we apply deep learning to more and 00:01:57.360 |
And instead of holding all your questions until the end, 00:02:01.360 |
if you have a question in the middle, feel free to raise your hand as well. 00:02:03.760 |
I'm very happy to take questions in the middle, 00:02:05.720 |
since this is a more maybe informal whiteboard talk, right? 00:02:13.800 |
>> So one question that I still get asked sometimes is, and 00:02:20.840 |
a lot of the basic ideas of deep learning have been around for decades. 00:02:27.200 |
Why is it that deep learning, these neural networks have all known for 00:02:30.680 |
maybe decades, why are they working so well now? 00:02:33.800 |
So I think that the one biggest trend in deep learning is scale, 00:02:41.440 |
And I think Andre mentioned scale of data and scale of computation, and 00:02:45.960 |
I'll just draw a picture that illustrates that concept maybe a little bit more. 00:02:50.840 |
So if I plot a figure, where on the horizontal axis I plot 00:02:56.520 |
the amount of data we have for a problem, and on the vertical axis we plot performance. 00:03:03.000 |
Right, so x-axis is the amount of spam data you've collected, 00:03:06.160 |
y-axis is how accurately can you classify spam. 00:03:10.080 |
Then if you apply traditional learning algorithms, 00:03:13.920 |
right, what we found was that the performance 00:03:18.640 |
often looks like it starts to plateau after a while. 00:03:23.600 |
It was as if the older generations of learning algorithms, 00:03:26.680 |
including support, logistic regression, SVMs, 00:03:30.600 |
was as if they didn't know what to do with all the data that we finally had. 00:03:34.000 |
And what happened kind of over the last 20 years, 00:03:36.920 |
last 10 years, was with the rise of the Internet, rise of mobile, rise of IoT. 00:03:42.320 |
Where's the society sort of marched to the right of this curve, right, for 00:03:48.840 |
And so with all the buzz and all the hype about deep learning, 00:03:53.280 |
in my opinion, the number one reason that deep learning algorithms work so 00:03:58.120 |
well is that if you train, let me call it a small neural net, 00:04:11.880 |
right, maybe you get even better performance. 00:04:18.440 |
And it's only if you train a large neural net that you could train a model with 00:04:24.720 |
the capacity to absorb all this data that we don't have access to, 00:04:28.320 |
that allows you to get the best possible performance. 00:04:30.840 |
And so I feel like this is a trend that we're seeing in many verticals in 00:04:36.880 |
One is that this, actually when I draw this picture, some people ask me, well, 00:04:42.400 |
does this mean a small neural net always dominates a traditional learning algorithm? 00:04:47.720 |
Technically, if you look at the small data regime, 00:04:50.520 |
if you look at the left end of this plot, right, 00:04:54.440 |
the relative ordering of these algorithms is not that well defined. 00:04:57.480 |
It depends on who's more motivated to engineer the features better, right? 00:05:00.840 |
If the SVM guy is more motivated to spend more time engineering features, 00:05:05.520 |
they might beat out the neural network application. 00:05:12.800 |
a lot of the knowledge of the algorithm comes from hand engineering, right? 00:05:16.120 |
But this trend is much more evident in a regime of big data where you just can't 00:05:21.760 |
And the large neural net combined with a lot of data tends to outperform. 00:05:29.400 |
The implication of this figure is that in order to get the best performance, 00:05:33.200 |
in order to hit that target, you need two things, right? 00:05:36.240 |
One is you need to train a very large neural network, 00:05:39.280 |
or reasonably large neural network, and you need a large amount of data. 00:05:45.040 |
And so this in turn has caused pressure to train large neural nets, right? 00:05:51.520 |
Build large nets as well as get huge amounts of data. 00:05:54.800 |
So one of the other interesting trends I've seen is that 00:05:58.120 |
increasingly I'm finding that it makes sense to build an AI team as well as 00:06:05.320 |
build a computer systems team and have the two teams kind of sit next to each other. 00:06:09.320 |
And the reason I say that is, I guess, so let's see. 00:06:13.320 |
So when we started Baidu Research, we set our team that way. 00:06:18.280 |
I think Peter mentioned to me that OpenAI also has a systems team and 00:06:23.720 |
And the reason we're starting to organize our teams that way, I think, 00:06:26.120 |
is that some of the computer systems work we do, right? 00:06:29.480 |
So we have an HPCT team, high performance team, 00:06:34.160 |
Some of the extremely specialized knowledge in HPC is just incredibly difficult for 00:06:41.720 |
Maybe Jeff Dean is smart enough to learn everything. 00:06:44.080 |
But it's just difficult for any one human to be sufficiently expert in HPC and 00:06:52.760 |
And so we've been finding, and Shubo actually, 00:06:56.160 |
one of the co-organizers on our HPC team, we've been finding that bringing 00:07:01.160 |
knowledge from these multiple sources, multiple communities, 00:07:07.560 |
You've heard a lot of fantastic presentations today. 00:07:12.320 |
I want to draw one other picture, which is, in my mind, 00:07:16.160 |
this is how I mentally bucket work in deep learning. 00:07:20.320 |
So this might be a useful categorization, right? 00:07:23.520 |
you can mentally put each talk into one of these buckets I'm about to draw. 00:07:27.640 |
But I feel like there's a lot of work on, I'm gonna call, you know, 00:07:33.160 |
And this is basically what the type of model that Hugo La Rochelle talked about 00:07:36.640 |
this morning, where you have, you know, really densely connected layers, right? 00:07:42.240 |
I guess FC, right, was the- so there's a huge bucket of models there. 00:07:47.800 |
And then I think a second bucket is sequence models. 00:07:51.400 |
So 1D sequences, and this is where I would bucket a lot of the work on RNNs, 00:08:00.800 |
you know, LSTMs, right, GRUs, some of the attention models, 00:08:05.760 |
which I guess probably Yoshua Benjamin's talk about tomorrow, or maybe others, 00:08:11.480 |
But so the 1D sequence models is another huge bucket. 00:08:17.720 |
This is really 2D and maybe sometimes 3D, but this is where I would tend to 00:08:22.080 |
bucket all the work of CNNs, convolutional nets. 00:08:26.480 |
And then in my mental bucket, then there's a fourth one, which is the other, right? 00:08:31.320 |
And this includes unsupervised learning, you know, 00:08:34.760 |
the reinforcement learning, right, as well as lots of other creative ideas, 00:08:41.880 |
You know, like, what's, I still find slow feature analysis, 00:08:45.040 |
sparse coding, ICA, various models kind of in the other category, super exciting. 00:08:52.080 |
So it turns out that if you look across industry today, 00:08:55.600 |
almost all the value today is driven by these three buckets, right? 00:09:03.040 |
So what I mean is those three buckets of algorithms are driving, 00:09:08.280 |
causing us to have much better products, right, or monetizing very well. 00:09:12.280 |
It's just incredibly useful for lots of things. 00:09:15.480 |
In some ways, I think this bucket might be the future of AI, right? 00:09:18.640 |
So I find unsupervised learning especially super exciting. 00:09:21.600 |
So I'm actually super excited about this as well. 00:09:25.240 |
Although I think that if on Monday you have a job and you're trying to build 00:09:29.520 |
a product or whatever, the chance of you using something from one of these three 00:09:35.040 |
But I definitely encourage you to contribute to research here as well, right? 00:09:39.320 |
So I said that trend one, the major trend one of deep learning is scale. 00:09:47.840 |
This is what I would say is maybe major trend two, of two of two trends. 00:09:55.080 |
I feel major trend two is the rise of end-to-end deep learning, 00:10:07.480 |
I'll say a little bit more in a second exactly what I mean by that. 00:10:10.120 |
But the examples I'm gonna talk about are all from one of these three buckets, 00:10:13.040 |
right, general DL, sequence models, image 2D, 3D models. 00:10:16.960 |
But let's see, it's best illustrated with a few examples. 00:10:21.400 |
Until recently, a lot of machine learning used to output just row numbers. 00:10:26.240 |
So I guess in Richard's example, you have a movie review, right? 00:10:32.000 |
And then, actually, I prepared totally different examples. 00:10:34.760 |
I was editing my examples earlier to be more coherent with the speakers before me. 00:10:39.840 |
We have a movie review and then output the sentiment. 00:10:42.920 |
Is this a positive or a negative movie review? 00:10:48.400 |
And then you want to do image net object recognition. 00:10:57.680 |
a lot of machine learning was about outputting a single number, 00:11:04.440 |
the number two major trend that I'm really excited about is, um, 00:11:08.440 |
enter in deep learning algorithms that can output much more complex things than numbers. 00:11:16.120 |
image captioning where instead of taking an image and saying this is a cat, 00:11:20.200 |
you can now take an image and output, you know, 00:11:22.880 |
an entire string of text using RNN to generate that sequence. 00:11:34.520 |
A whole bunch of people have, have, have worked on this problem. 00:11:37.160 |
Um, one of the things that I guess, uh, my, my, 00:11:41.080 |
my collaborator Adam Coates will talk about tomorrow, uh, 00:11:45.680 |
is, um, speech recognition where you take as input audio and you directly output, 00:11:54.640 |
And so, um, when we first proposed using this kind of 00:11:58.320 |
end-to-end architecture to do speech recognition, this is very controversial. 00:12:04.360 |
but the idea of actually putting this in the production speech system was very, 00:12:11.000 |
But I think the whole community is coming around to this point of view more recently. 00:12:33.600 |
And, and, and you saw some examples of image synthesis. 00:12:40.440 |
of, of deep learning that I find very exciting and, and, I mean, 00:12:45.480 |
transformative things that we just couldn't build three or four years ago, 00:12:48.440 |
has in this trend to what not just learning algorithms and output, 00:12:53.600 |
but it can output very complicated things like a sentence or a caption or 00:13:00.000 |
or like the recent WaveNet paper output audio, right? 00:13:17.920 |
I think that end-to-end deep learning, you know, 00:13:22.360 |
Um, I wanna give you some rules of thumb for deciding when to use, 00:13:25.480 |
what is exactly end-to-end deep learning and when to use it and when not to use it. 00:13:28.360 |
So, so I'm moving to the second bullet and we'll, we'll, we'll go through these. 00:13:32.760 |
So the trend towards end-to-end deep learning has been, um, 00:13:40.760 |
this idea that instead of engineering a lot of intermediate representations, 00:13:45.840 |
maybe you can go directly from your raw input to whatever you wanna predict, right? 00:13:51.520 |
So for example, actually it's a tick because I'm, 00:13:58.360 |
um, previously one used to go from the audio to, you know, 00:14:03.920 |
hand-engineered features like MFCCs or something and then maybe extract phonemes, 00:14:09.120 |
right? Um, and then eventually you try to generate the transcript. 00:14:14.720 |
Um, for those of you that aren't sure what a phoneme is. 00:14:27.280 |
um, basic units of sound such as k as a phoneme, 00:14:31.040 |
uh, and is, um, hypothesized by linguists to be the basic units of sound. 00:14:37.120 |
maybe the three phonemes that make up the word cat, right? 00:14:43.640 |
and I think 2011 Lee Dang and Jeff Hinton, um, 00:14:46.640 |
made a lot of progress in speech recognition by saying we can use deep learning to do this first step. 00:14:51.880 |
Um, but the end-to-end approach to this would be to say, 00:15:00.720 |
right, input the audio and output the transcript. 00:15:13.080 |
So the phrase end-to-end deep learning refers to, uh, 00:15:17.400 |
like a learning algorithm directly go from input to output. 00:15:27.680 |
and it's actually very simple but it only works sometimes. 00:15:32.720 |
yeah, I'll just tell you this interesting story. 00:15:34.800 |
You know, this end-to-end story really upset a lot of people. 00:15:47.200 |
And I still remember there was a meeting at Stanford, 00:15:50.480 |
there was a linguist kind of yelling at me in public, uh, for saying that. 00:15:55.720 |
uh, but we turned out to be right, you know, so. 00:16:08.160 |
the Achilles heel of a lot of deep learning is that you need tons of label data, right? 00:16:26.600 |
one may or may not consider end-to-end deep learning. 00:16:28.640 |
Um, this is a problem I learned about just last week from Curtis Langlis and, 00:16:37.880 |
x-ray pictures of your hand in order to predict the child's age, right? 00:16:42.240 |
doctors actually care to look at an x-ray of your, 00:16:45.120 |
of a child's hand in order to predict the, the age of the child. 00:16:48.120 |
So, um, boy, let me draw an x-ray image, right? 00:16:53.160 |
the child's hand. So these are the bones, right? 00:17:00.000 |
Okay. So that's a hand and, and, and you see the bones. 00:17:03.240 |
Um, and so more traditional algorithm might input an image, 00:17:16.320 |
there's a bone here, and then maybe measure the length of these bones, right? 00:17:30.040 |
some simple thing to go from the bone length to estimate the age of the child, right? 00:17:34.640 |
So this is a non-end-to-end approach to try and solve this problem. 00:17:37.960 |
Um, an end-to-end approach would be to take an image, 00:17:45.800 |
And I think this is one example of a problem where, um, 00:17:49.560 |
it's very challenging to get end-to-end deep learning to work, 00:17:55.000 |
You just don't have enough X-rays of children's hands annotated with their ages. 00:17:59.560 |
And instead, where we see deep learning coming in is in this step, right? 00:18:09.840 |
But the advantage of this non-end-to-end architecture is it 00:18:13.600 |
allows you to hand engineer in more information about the system, 00:18:19.840 |
Which- which you can kind of get tables about. 00:18:24.400 |
And I think one of the unfortunate things about deep learning is that, um, let's see. 00:18:29.880 |
Uh, you know, you can- for- for- for suitably sexy values of X and Y, 00:18:35.640 |
you could almost always train a model and publish a paper. 00:19:07.920 |
Um, who was it just now who said you better do it quickly? 00:19:12.360 |
Yeah. Um, let me give a couple of other examples, uh, uh, 00:19:15.600 |
that where- where it might be harder to backprop all the way through, right? 00:19:21.440 |
You know, most teams are using an architecture where you input an image, 00:19:25.160 |
you know, what's in front of the car, let's say. 00:19:30.160 |
Uh, and then- and- and maybe use the image to detect pedestrians, right? 00:19:35.440 |
Self-driving cars are obviously more complex than this, right? 00:19:38.000 |
Uh, but then now that you know where the other cars and where 00:19:45.400 |
to then, you know, come up with a trajectory, right? 00:19:53.100 |
what's the trajectory that you want your car to drive through, um, 00:20:08.240 |
this is actually the architecture that most self-driving car teams are using. 00:20:13.800 |
there have been interesting approaches to- to say, 00:20:16.880 |
well, I'm gonna input an image and I'll put a steering direction, right? 00:20:27.760 |
I'd be very cautious about this second approach because- and I think, 00:20:33.120 |
the second approach will work and you can even prove a theorem, 00:20:35.880 |
you know, showing that it will work, I think. 00:20:40.280 |
enough data to make the second approach really, really work well, right? 00:20:44.040 |
And- and I think kind of the- the- Peter made a great comment just now. 00:20:46.720 |
And I think, you know, some of these components will be incredibly complicated, 00:20:50.800 |
you know, like this could be a power plan of an ex- explicit search. 00:20:53.640 |
And you could actually design a really complicated power plan and generate the trajectory. 00:20:58.140 |
And your ability to hand code that still has a lot of value, right? 00:21:10.620 |
Um, but unless you actually have the data, you know, 00:21:13.620 |
some of these things make for great demos if- if you cherry pick the examples. 00:21:17.440 |
But- but it can be challenging to, um, get to work at scale. 00:21:24.980 |
I don't think it's a- I don't think this will necessarily fail. 00:21:27.900 |
I just think the data needed to do this will be- will be really immense. 00:21:31.620 |
So I- I'd be very cautious about it and then right now. 00:21:40.380 |
one of the themes that comes up in machine learning, 00:21:44.140 |
or really if you're working on a machine learning project, 00:21:54.420 |
train it, maybe it doesn't work as well as you're hoping yet. 00:21:57.360 |
And the question is, what do you do next, right? 00:21:59.860 |
This is a very common part of a machine learning, you know, 00:22:02.620 |
a researcher or a machine learning engineer's life, 00:22:04.620 |
which is, you know, you- you- you train a model, 00:22:15.700 |
maybe you want a different neural network architecture, 00:22:24.980 |
a lot of the skill of a machine learning researcher or 00:22:27.320 |
a machine learning engineer is knowing how to make these decisions, right? 00:22:30.640 |
And- and- and the difference in performance and whether you, you know, 00:22:33.720 |
do you train a bigger model or do you try regularization, 00:22:36.960 |
your skill at picking these decisions will have a huge impact on how 00:22:40.160 |
rapidly, um, uh, you can make progress on an actual machine learning problem. 00:22:44.800 |
So, um, I want to talk a little about bias and variance, 00:22:52.400 |
And I feel like it's evolving slightly in the era of- of- of deep learning. 00:22:59.260 |
let's say the goal is to build a human level, 00:23:15.880 |
especially in academia, is we'll get a dataset, you know, 00:23:20.880 |
and then we shuffle it and we randomly split it into 70-30 trained tests, 00:23:35.840 |
but I'm- I'm just gonna use the set- the dev set, 00:23:42.400 |
Um, and so what we would- what- what- what I encourage you to do if you aren't already, 00:23:55.320 |
So let- let- actually let me illustrate an example. 00:24:03.900 |
you know, human level error is, uh, 1% error. 00:24:07.560 |
Um, let's say that your training set error is, 00:24:20.040 |
really, right, dev set is a proxy for test set except you tune to the dev set, right? 00:24:28.760 |
Okay. So this is one of the most basics- this- this is really a, 00:24:33.920 |
a step in developing a learning algorithm that I encourage you to do if you aren't already, 00:24:42.080 |
really helps in terms of telling you what to do next. 00:24:46.880 |
you see that you're doing much worse than human level performance. 00:24:49.640 |
Um, and so you see that there's a huge gap here from 1% to 5%. 00:25:00.920 |
I'm using the terms bias and variance informally and 00:25:03.280 |
doesn't correspond exactly to the way they're defined in textbooks. 00:25:06.200 |
But I find these useful concepts for- for- for deciding how to make progress on your problem. 00:25:14.120 |
in this example, you have a high bias classifier. 00:25:35.460 |
then, you know, you really have a high, what, 00:25:38.180 |
variance problem, right, like an overfitting problem. 00:25:41.680 |
And this tells you- this really tells you what to do, what to try, right? 00:25:54.520 |
Um, and then there's also really a third case which is if you have, 00:26:07.360 |
Oh, actually, I'm gonna say 5% depth set error, 00:26:16.640 |
And in this case, you have high bias and high variance, right? 00:26:23.960 |
high bias and high variance, you know, like sucks for you, right? 00:26:26.840 |
Um, so I feel like that when I talk to applied machine learning teams, 00:26:41.200 |
decisions about what you should be doing on your machine learning application. 00:26:45.400 |
Um, and by- if- if you're wondering why I'm talking about this and what this has to do with deep learning, 00:26:53.280 |
But, uh, uh, I feel like there's this, you know, 00:26:55.560 |
almost a workflow, like almost a- a flow chart, right? 00:27:06.480 |
Oh, and I hope I'm writing big enough that people can see it. 00:27:11.600 |
let me know and I'll- and I'll read it back out, right? 00:27:15.600 |
are you even doing well in your training set? 00:27:17.560 |
Um, and- and- and if your training error is high, 00:27:23.040 |
And so your standard tactics like train a bigger model, 00:27:31.640 |
make sure that your- your optimization algorithm is- is doing a good enough job. 00:27:35.280 |
Um, and then there's also this magical one which is a new model architecture, 00:27:43.880 |
And then you kind of keep doing that until you're doing well at least on your training set. 00:27:49.520 |
Once you're at least doing well on your training set, 00:27:53.480 |
So no, training error is not unacceptably high. 00:28:14.120 |
And so, you know, the solutions are try to get more data, 00:28:33.280 |
And then until- and- and you kind of keep doing this until your, 00:28:38.400 |
uh, depth set error is- is- is- is no- is- I guess 00:28:42.520 |
until both you're doing well on your training set and on your depth set. 00:28:45.960 |
And then, you know, hopefully, right, you're done. 00:28:52.760 |
one of the nice things about this era of deep learning is that no matter- it's kind of, 00:28:57.840 |
no matter where you're stuck with modern deep learning tools, 00:29:01.540 |
we have a clear path for making progress in a way that was not true, 00:29:05.300 |
or at least was much less true in the era before deep learning, 00:29:13.100 |
really high bias or high variance or maybe both, right? 00:29:15.620 |
You always have at least one action you can take, 00:29:20.860 |
So you could- so- so- so in the deep learning era, 00:29:23.940 |
relative to say the logistic regression error, 00:29:26.100 |
the SVM error, it feels like we more often have 00:29:29.060 |
a way out of whatever problem we're stuck in. 00:29:33.500 |
people talk less about bias-variance trade-off. 00:29:35.960 |
You might have heard that term, bias-variance trade-off, 00:29:40.020 |
And the reason we talked a lot about that in the past was 00:29:42.440 |
because a lot of the moves available to us like tuning regularization, 00:29:56.020 |
really one of the reasons I think deep learning has been so powerful, 00:29:58.320 |
is that the coupling between bias and variance can be weaker. 00:30:05.940 |
reduce bias without increasing variance or reduce variance without increasing bias. 00:30:09.260 |
And really the bigger- the- the- the big one is really, 00:30:13.140 |
bigger neural network in a way that was harder to do when you're 00:30:16.380 |
training which is regression is to come up with more and more features, right? 00:30:26.340 |
One of the- and I'm gonna add more to this diagram at the bottom in a second, okay? 00:30:47.840 |
but- but even if you aren't super experienced with, 00:30:53.800 |
you can often do those and that will drive a lot of progress, right? 00:30:59.060 |
how to tune a confident versus a resonant versus whatever, 00:31:03.580 |
Definitely encourage you to keep mastering those as well. 00:31:07.980 |
bigger- bigger model, more data is enough to do very well on a lot of problems. 00:31:18.220 |
Oh, so bigger model puts pressure on, you know, 00:31:22.420 |
systems which is why we- we have high-performance computing team. 00:31:26.220 |
Um, more data has led to another interesting, 00:31:41.860 |
we try to come up with all sorts of clever ways to come- to- to- to get data. 00:31:45.940 |
Um, one- one area that- that I'm seeing more and more activity in, right? 00:32:04.860 |
and there was a lot of skill in hand engineering the features of, you know, 00:32:07.980 |
like the SIF or the HOG or whatever to feed into SVM. 00:32:11.540 |
Um, automatic data synthesis is this little area that is small, 00:32:20.340 |
but I'm seeing quite a lot of progress in multiple problems is enabled by 00:32:24.380 |
hand engineering, uh, synthetic data in order to feed into the giant mole of your neural network. 00:32:30.540 |
All right. So let me- let me best illustrate it with a couple of examples. 00:32:42.940 |
This is one of my most useful APIs at Baidu, right? 00:32:49.860 |
um, and downloading a random picture off the Internet, 00:32:55.700 |
choose a random word in the English dictionary, 00:32:58.260 |
and just type the English word into Microsoft Word in a random font, 00:33:04.340 |
like a transparent background on top of a random image off the Internet, 00:33:07.500 |
then you just synthesize a training example for OCR, right? 00:33:11.020 |
Um, and so this gives you access to essentially unlimited amounts of data. 00:33:15.100 |
It turns out that the simple idea I just described won't work in its natural form. 00:33:19.300 |
You actually need to do a lot of tuning to blur the synthesized text of the background, 00:33:24.340 |
to make sure the color contrast matches your training distribution. 00:33:27.460 |
So found in practice can be a lot of work to fine-tune how you synthesize data. 00:33:48.340 |
months with very little progress and then suddenly he got the parameters right, 00:33:52.620 |
and he had huge amounts of data and was able to 00:33:55.180 |
build one of the best OCR systems in the world at that time, right? 00:33:58.780 |
Um, other examples, speech recognition, right? 00:34:07.140 |
for building a, uh, effective speech system is if you take clean audio, you know, 00:34:11.740 |
this is like a clean relatively noiseless audio, 00:34:14.900 |
and take random background sounds and just synthesize what 00:34:18.820 |
that person's voice would sound like in the presence of that background noise, right? 00:34:28.380 |
and record a lot of clean audio of someone speaking in a quiet environment, 00:34:32.500 |
um, the mathematical operation is actually addition, 00:34:36.300 |
but you basically add the two waveforms together, 00:34:38.660 |
and then you get an audio clip that sounds like that person talking in the car, 00:34:41.980 |
and you feed this to your learning algorithm. 00:34:46.300 |
in terms of amplifying the training set for speech recognition, 00:34:49.060 |
and has a huge effect, can have a huge- we found a huge effect on, um, performance. 00:34:55.820 |
You know, here's, here's one example actually done by, uh, 00:35:00.340 |
using end-to-end deep learning to do grammar correction. 00:35:03.260 |
So input a ungrammatical English sentence, you know, 00:35:06.540 |
maybe written by a non-native speaker, right? 00:35:12.540 |
input an ungrammatical sentence and correct the grammar, 00:35:18.740 |
huge amounts of this type of data automatically. 00:35:20.820 |
And so there'll be another example where data synthesis, um, works very well. 00:35:36.580 |
applications of RL, deep RL these days is video games. 00:35:39.620 |
And I think if you think supervised learning has 00:35:41.940 |
an insatiable hunger for data wait till you work with RL algorithms, right? 00:35:45.620 |
I think the, the hunger for data is even greater. 00:35:49.140 |
the advantage of that is you can synthesize almost infinite amounts of data to, 00:36:05.860 |
Um, you know, let's say you wanna recognize cars, right? 00:36:15.420 |
So there's a bunch of cars in Grand Theft Auto. 00:36:17.100 |
Why don't we just take pictures of cars from Grand Theft Auto, 00:36:25.100 |
Um, it turns out that's difficult to do because from the human perceptual system, 00:36:32.180 |
but it looks great to you because you can't tell if 00:36:34.420 |
there are 20 cars in the game or 1,000 cars in the game, right? 00:36:37.540 |
And so there are situations where the synthetic dataset looks great to you, 00:36:41.740 |
because 20 cars in a video game is plenty, it turns out. 00:36:44.580 |
Uh, you don't need 100 different cars for the human to think it looks realistic. 00:36:47.860 |
But from the perspective of learning algorithm, 00:36:53.860 |
so long to be, to be sourced out for data synthesis. 00:36:59.620 |
one, one practice I would strongly recommend is to have a unified data warehouse, right? 00:37:06.660 |
Um, so what I mean is that if your teams, if your, you know, 00:37:12.540 |
engineer teams or research teams are going around trying to 00:37:14.900 |
accumulate the data from lots of different organizations in your company, 00:37:18.260 |
that's just going to be a pain, it's going to be slow. 00:37:30.460 |
Uh, we, we, we should have a discussion about user access rights, 00:37:37.900 |
so we mandate this data needs to come into one log- uh, 00:37:42.900 |
So it's physically distributed across lots of data centers, 00:37:49.100 |
but what we should not discuss is whether or not to bring 00:37:51.700 |
together data into as unified a data warehouse as possible. 00:37:55.340 |
And so this is another practice that I found, um, 00:37:58.340 |
makes access to data just much smoother and allows, 00:38:01.580 |
you know, teams to, to, to, to drive performance. 00:38:21.580 |
It turns out that this idea of a 70/30 split, right? 00:38:25.060 |
Trained test or whatever, this was common in, um, 00:38:28.860 |
machine learning kind of in the past when, you know, frankly, 00:38:32.260 |
most of us in academia were working on relatively small data sets, right? 00:38:36.900 |
there used to be this thing called the UC Irvine 00:38:41.060 |
You know, by today's- it's amazing resource at the time, 00:38:54.620 |
machine learning today is much more common for 00:38:58.220 |
your train and your test distributions to come from different distributions, right? 00:39:02.220 |
And, and this creates new problems and new ways of thinking about bias and variance. 00:39:10.260 |
and this is a real example from Baidu, right? 00:39:12.300 |
We built a very effective speech recognition system. 00:39:17.820 |
we wanted to launch a new product that uses speech recognition. 00:39:20.900 |
Um, we wanted a speech-enabled rear view mirror, right? 00:39:24.060 |
So, you know, if you have a car that doesn't have a built-in GPS unit, right? 00:39:28.020 |
Uh, we wanted, and this is a real product in China, 00:39:30.580 |
we want to let you take out your rear view mirror and put a new, you know, 00:39:35.060 |
AI-powered speech-powered rear view mirror because it's 00:39:38.060 |
an easier, uh, uh, uh, like an off-the-market installation. 00:39:41.380 |
So you can speak to your rear view mirror and say, 00:39:48.140 |
Um, so, so, so how do you build a speech recognition system 00:39:51.660 |
for this in-car speech-enabled rear view mirror? 00:39:58.380 |
We have, you know, let's call it 50,000 hours of data from, 00:40:03.700 |
from a speech recognition data from all sorts of places, right? 00:40:11.300 |
but a lot of data collected from all sorts of places, 00:40:14.260 |
but not your in-car rear view mirror scenario, right? 00:40:18.060 |
And then our product managers can go around and, 00:40:24.260 |
let's say they collect 10 hours more of data from 00:40:27.620 |
exactly the rear view mirror scenario, right? 00:40:36.780 |
exactly the distribution that you want to test on. 00:40:39.460 |
So the question is, what do you do now, right? 00:40:42.180 |
Do you throw this 50,000 hours of data away because it's 00:40:53.540 |
So it was more common to build one speech model for a rear view mirror, 00:41:02.500 |
it's becoming more and more common to just pile 00:41:04.140 |
all the data into one model and let the model sort it out. 00:41:10.100 |
And if you do little tech- if you get the features right, 00:41:12.780 |
you could usually pile all the data into one model, 00:41:20.420 |
But the question is, given this dataset, you know, 00:41:23.540 |
how do you split this into trained dev tests, right? 00:41:43.040 |
make sure your development set and test sets are from the same distribution. 00:41:54.580 |
Right. I've been finding that this is one of the tips that 00:41:57.540 |
really boosts the effectiveness of a machine learning team. 00:42:00.940 |
Um, so in particular, I would make this the training set, 00:42:06.620 |
well, let me expand this a little bit, right? 00:42:08.500 |
Much smaller dataset, maybe five hours dev, five hours of tests. 00:42:16.100 |
your team will be working to tune things on the dev set, right? 00:42:20.420 |
And the last thing you want is if they spend three months working on the dev set, 00:42:31.900 |
having different dev and test set distributions is a bit like if I tell you, 00:42:44.900 |
And you go, "What? Why should you tell me to go north? 00:42:47.860 |
Right. And so I think having dev and test sets be from the same distribution is one of 00:42:53.500 |
the ideas that I found really optimizes the team's efficiency because it, you know, 00:42:57.860 |
the development set, which is what your team is going to be tuning algorithms to, 00:43:02.060 |
that is really the problem specification, right? 00:43:04.740 |
And you, problem specification tells them to go here, 00:43:12.020 |
having dev and test from the same distribution, 00:43:19.060 |
this really improves the, the, the, the, um, the team's efficiency. 00:43:23.860 |
Um, and another thing is once you specify the dev set, 00:43:27.620 |
that's like your problem specification, right? 00:43:33.100 |
Your team might go and collect more training data or change the training set, 00:43:38.220 |
But, but, you know, you shouldn't change the test set if 00:43:40.980 |
the test set is, is, is your problem specification, right? 00:43:46.420 |
what I actually recommend is splitting your training set as follows. 00:44:02.100 |
There's basically a development set that's from 00:44:06.180 |
uh, and then you have your dev set and your test set, right? 00:44:09.940 |
from the distribution you actually care about. 00:44:15.260 |
and maybe we aren't even entirely sure what data this is. 00:44:26.020 |
Um, and then, here's the generalization of the bias-variance concept. 00:44:41.220 |
the, the, um, the fact that training and test sets don't match is one of 00:44:46.300 |
the problems that, um, academia doesn't study much. 00:44:52.260 |
But it turns out that when you train and test on different distributions, 00:44:57.740 |
It's a little bit luck whether you generalize well to a totally different test set. 00:45:01.580 |
So that's made it hard to study systematically, 00:45:05.780 |
academia has not studied this particular problem as much as I feel it is important. 00:45:11.180 |
So to, to those of us building production systems. 00:45:13.620 |
But there is some work, but, but not, no, no, 00:45:16.020 |
no very widely deployed solutions yet would be, would my sense. 00:45:21.900 |
if you now generalize what I was describing just now to the following, 00:45:25.060 |
which is, um, measure human level performance, 00:45:40.340 |
and measure your test set performance, right? 00:45:49.060 |
Um, and I'm, I'm gonna use very obvious examples for illustration. 00:46:02.420 |
In this example, then it's quite clear that you have a huge gap between 00:46:05.780 |
human level performance and training set performance and so you have a huge bias, right? 00:46:13.780 |
the, the bias fixing types of, um, uh, uh, solutions. 00:46:18.140 |
Um, and then, um, there's just one example I wanna, well. 00:46:27.100 |
one of the most useful things is to look at the aggregate error of your system, 00:46:37.180 |
to figure out how much of whatever comes from where, 00:46:42.020 |
So this accumulation of errors, this difference here, 00:46:47.620 |
So I would work on the bias reduction techniques. 00:46:55.660 |
This gap here is due to your train test distribution mismatch. 00:47:18.620 |
here's an example where you have high train test error mismatch, right? 00:47:44.740 |
then I would say you have a huge train test set mismatch problem. 00:47:48.460 |
Okay. Um, and so at this basic level of analysis, 00:47:52.140 |
what, you know, this formula for machine learning, 00:48:17.740 |
If yes, then you have a train test mismatch problem. 00:48:23.740 |
And there the solution would be to try to get more data, 00:48:34.060 |
Or maybe a data synthesis or data augmentation. 00:48:38.700 |
You know, try to tweak your training set to make it look more like your test set. 00:48:43.260 |
Um, and then there's always this kind of, uh, uh, 00:48:47.500 |
which is, you know, a new architecture, right? 00:48:54.380 |
Um, and then finally, just to finish this up, you know, 00:49:02.820 |
hopefully your test set error will be, will be good. 00:49:05.740 |
And if, if you're doing well on your depth set but not your test set, 00:49:34.300 |
if your depth set error is not high but your test set error is high, 00:49:53.940 |
it sounds so simple but it's actually much diffic- much more difficult to apply in 00:49:58.500 |
practice than it sounds when I talk about it on, on text, right? 00:50:04.540 |
just calculate these numbers and this can help drive your analysis in terms of deciding what to do. 00:50:14.020 |
And, and I find that it takes surprisingly long to really grok, 00:50:19.100 |
to really understand bias and variance deeply. 00:50:21.700 |
But I find that people that understand bias and variance deeply are often 00:50:30.380 |
And, and I know it's much sexier to show you some cool new network architecture and, 00:50:36.980 |
and this, this really helps our teams make rapid progress on things. 00:50:46.940 |
there's one thing I, I, I kind of snuck in here without making it explicit, 00:50:54.580 |
we were benchmarking against human level performance, right? 00:51:01.020 |
another thing that, that, that has been different. 00:51:05.300 |
I'm looking across a lot of projects I've seen in many areas and 00:51:07.860 |
trying to pull out the common trends but I find that comparing to 00:51:11.300 |
human level performance is a much more common theme now than several years ago, right? 00:51:20.220 |
Um, and, and, and really at Baidu we compare our speech systems 00:51:23.460 |
to human level performance and try to exceed it and so on. 00:51:29.420 |
so why, why, why is human level performance, right? 00:51:32.620 |
Such a, such a common theme in, in applied deep learning. 00:51:43.660 |
you know, how long you've been working on a project. 00:51:51.060 |
you know, like human level accuracy or human level performance on some task, 00:51:58.260 |
your teams will make rapid progress, you know, 00:52:02.060 |
up until they get to human level performance. 00:52:05.580 |
And then often it will maybe surpass human level performance a bit, 00:52:09.420 |
and then progress often gets much harder after that, right? 00:52:12.860 |
But this is a common pattern I see in a lot of problems. 00:52:16.220 |
Um, so there are multiple reasons why this is the case. 00:52:28.620 |
Oh, cool. Yep. Labels are coming from humans. Anything else? 00:52:37.980 |
Oh, interesting. Oxygen is modeled after the human brain. 00:52:42.060 |
the distance from neural nets to human brains is very far. 00:52:49.820 |
uh, to deal with these kind of problems is very similar. 00:52:53.420 |
I see. Yeah. Human capacity to deal with these problems is similar. 00:53:19.540 |
Cool. All right. So, so let me, let me, let me, uh. 00:53:23.180 |
all, all, all, you know, lots of great answers. 00:53:26.580 |
there are several good reasons for this type of effect. 00:53:35.260 |
there is some theoretical limit of performance, right? 00:53:42.660 |
In speech recognition, a lot of audio clips are just noisy. 00:53:48.020 |
they're in a rock concert or something and it's just 00:53:49.820 |
impossible to figure out what on earth they were saying, right? 00:53:52.580 |
Or some images, you know, are just so blurry. 00:53:54.620 |
It's just impossible to figure out what this is. 00:54:10.820 |
But really, there is some theoretical optimum where even if you had 00:54:16.540 |
with best possible parameters, it cannot do better than that because 00:54:19.660 |
the input is just noisy and sometimes impossible to label. 00:54:25.220 |
humans are pretty good at a lot of the tasks we do, not all, 00:54:29.420 |
but humans are actually pretty good at speech recognition, 00:54:32.740 |
And so, you know, by the time you surpass human level accuracy, 00:54:36.340 |
there might not be a lot of room, right, to go, to go further up. 00:54:42.540 |
Um, other reasons, I think a couple of people said, right? 00:54:47.500 |
so long as you're still worse than humans, uh, 00:54:51.220 |
you have better levels to make progress, right? 00:55:32.660 |
And error analysis just means look at your depth set, 00:55:37.700 |
and see, you know, see if the humans have any insight into why a human thought this is a cat, 00:55:46.860 |
but your system just, uh, mistranscribed this. 00:55:49.820 |
Um, and then I think another reason is that it's easier to estimate, 00:56:12.220 |
let's say that, uh, you're- you- let's say that you're working on some image recognition task. 00:56:41.260 |
bias reduction techniques or should you work on variance reduction techniques? 00:56:53.500 |
then you're pretty close on the training set to 00:56:56.620 |
human and you would think you have more of a variance problem. 00:57:03.380 |
1% error, then you know that even on the training set, 00:57:09.260 |
And so, well, you should build a bigger network or something, right? 00:57:12.500 |
So this piece of information about where humans are, 00:57:17.460 |
as an approximation for the Bayes error rate, for the optimal error rate. 00:57:20.500 |
This piece of information really tells you where you should focus your effort, 00:57:23.660 |
and therefore increases the efficiency of your team. 00:57:26.300 |
But once you surpass human level efficiency, I mean, 00:57:34.300 |
Then- then it's- it's- it's just slightly tougher. 00:57:36.940 |
So that's just another thing that- that- that becomes harder to do, 00:57:40.100 |
that you no longer have a proxy for estimating 00:57:42.980 |
the Bayes error rate to decide how to improve performance, right? 00:57:47.260 |
Um, so, you know, there are definitely lots of problems where we 00:57:49.980 |
surpass human level performance and keep getting better and better. 00:57:55.300 |
a lot of the- I find that my life building deep learning applications is 00:57:59.180 |
often easier until we surpass human level performance, 00:58:03.220 |
And after we surpass human level performance, um, well, 00:58:09.660 |
subsets of data where we still do worse than humans. 00:58:14.100 |
right now we surpass human level performance for speech accuracy, 00:58:16.980 |
uh, for short audio clips taken out of context. 00:58:20.580 |
we're still way worse than humans on one particular type of accented speech. 00:58:24.780 |
Then even if we are much better than humans in the aggregate, 00:58:27.420 |
if we find we're much worse than humans on the subset of data, 00:58:31.860 |
But remember this is kind of an advanced topic maybe, 00:58:33.900 |
where- where you segment the training set and analyze sub- 00:58:40.020 |
If you think there's a tool that can take a human error rate to 30% out of 1%, 00:58:47.700 |
I see. Actually, you know, that's a wonderful question. 00:58:50.020 |
I want to ask a related quiz question to everyone in the audience. 00:58:52.940 |
I'm gonna come back to- to- to what Alex just said. 00:58:55.300 |
All right. So, um, given everything we just said, 00:59:01.900 |
All right. Um, I'm gonna pose a question, uh, 00:59:04.940 |
write down four choices and then ask you to raise your hand to- to- to, 00:59:14.260 |
how the concept of human level accuracy is useful for driving machine learning progress. 00:59:19.780 |
Right. So, um, how do you define human level performance? 00:59:26.940 |
I'm spending a lot of time working on AI in healthcare. 00:59:29.060 |
So a lot of medical examples in my head right now. 00:59:31.060 |
But let's say that you want to do medical imaging for medical diagnosis. 00:59:42.420 |
So my question to you is how do you define human level performance? 00:59:53.700 |
Right. Let's say that the error rate at reading a certain type of medical image is 3%, 01:00:16.660 |
And let's say an expert doctor makes 0.7% error. 01:00:28.740 |
And what I mean is if I find a team of expert doctors and have a team look at 01:00:35.580 |
every image and debate and discuss and have them come to, 01:00:38.580 |
you know, the team's best guess of what's happening with this patient. 01:00:48.580 |
Which of these is the most useful definition of 01:00:52.420 |
human level error if you want to use this to drive the performance of your algorithms? 01:00:56.380 |
Okay. So who thinks choice A? Raise your hand. 01:01:07.580 |
Uh, don't worry about ease of obtaining this data. 01:01:09.820 |
Yeah. Right. So which is the most useful definition? 01:01:31.020 |
I think that for the purpose of driving machine learning progress, 01:01:34.620 |
I think ignoring the cost of collecting data was a great question. 01:01:38.420 |
Um, I would find this definition the most useful, um, 01:01:45.220 |
a lot of what we're trying to use human level performance as a proxy for is the base rate, 01:01:52.260 |
And, and really to measure the baseline level of noise in your data. 01:01:59.860 |
then you know that the mathematically optimal error rate 01:02:02.460 |
has got to be 0.5% or maybe even a little bit better. 01:02:05.580 |
Um, and so for the purpose of using this number to drive all these decisions, 01:02:10.380 |
such as, um, estimate bias and variance, right? 01:02:18.820 |
Um, uh, because you know that the base error rate is, 01:02:29.660 |
I would fully expect teams to use this definition, uh, uh, and, and, and, 01:02:34.060 |
and by the way, publishing papers is different than, um, 01:02:37.100 |
the goal of publishing papers is different than the goal of actually, 01:02:39.300 |
you know, building the best possible product, right? 01:02:42.820 |
people like to say, oh, we're better than the human level. 01:02:45.020 |
So for that, I guess using this definition would be what many people would do. 01:02:48.420 |
Um, uh, and, and, and if you're actually trying to collect data, you know, 01:02:58.380 |
If they're still unsure, then find, you know, 01:03:18.020 |
Oh, is it possible that team of expert doctors does worse than a single doctor? 01:03:21.500 |
I don't know. I, I, I had to ask the doctors in the audience. 01:03:30.740 |
Just, just, I have just two more pages and I'll wrap up. 01:03:34.380 |
one of the reasons I think in the era of deep learning, 01:03:36.780 |
we, uh, refer to human level performance much more frankly is because, 01:03:42.420 |
we are approaching human level performance, right? 01:03:47.420 |
when, when, I guess maybe to continue this example, right? 01:03:52.100 |
when your training set accuracy in computer vision was, you know, 01:04:00.060 |
then it didn't really matter if human level performance was 1% or 2% or 3%. 01:04:04.420 |
It, it didn't affect your decision that much because you're 01:04:06.700 |
just so clearly far, so far from Bayes, right? 01:04:09.660 |
But now as really more and more deep learning systems are 01:04:13.060 |
approaching human levels performance on all these tasks, 01:04:17.660 |
actually gives you very useful information to, 01:04:21.220 |
And so honestly for a lot of the teams I work with, 01:04:26.060 |
please go and figure out what is human level performance, 01:04:28.380 |
and, and then spend some time to have humans label and get that number because that 01:04:32.180 |
number is useful for, for, for, for driving some of these decisions. 01:04:36.780 |
So, um, just two last things and then we'll finish. 01:04:41.700 |
Um, you know, one question I get asked a lot is, um, 01:04:50.460 |
Um, and, and I guess maybe partially a company you often, 01:05:08.780 |
we've developed pretty good workflows for designing 01:05:11.260 |
products in the desktop era and in the mobile era, right? 01:05:19.060 |
excuse me, the, the product manager draws a wireframe, 01:05:21.300 |
the designer does the visual design or something or, 01:05:23.820 |
you know, they work together and then the program implements it. 01:05:25.820 |
So we have well-defined workflows for how to design, 01:05:29.340 |
you know, typical apps like the Facebook app or the Snapchat app or whatever. 01:05:34.780 |
We have workflows established in companies to design stuff like that. 01:05:40.980 |
I feel like we don't have good processes yet for designing AI products. 01:05:45.740 |
So for example, how should a product manager specify, 01:05:52.940 |
How does a product manager specify what level of accuracy is needed for my cat detector? 01:06:02.540 |
um, I find us inventing new processes in order to design AI product, right? 01:06:13.260 |
sometimes by business people is what can AI do? 01:06:15.060 |
Because when a product manager is trying to design a new thing, 01:06:18.240 |
you know, it's nice you can help them know what they can design, 01:06:25.860 |
so, so I'm only giving some rules of thumb that are far from perfect, 01:06:29.500 |
but that I found useful for thinking about what AI can do. 01:06:32.980 |
Oh, oh, before I tell you the rules I use, um, 01:06:35.940 |
here's one of the rules of thumb that a product manager I know was using, 01:06:39.520 |
which is he says, assume that AI can do absolutely anything, right? 01:06:44.420 |
And, and, and, and this actually wasn't terrible. 01:06:47.660 |
It actually led to some good results, but, but I wanna, 01:06:54.380 |
ways of communicating about modern deep learning, 01:06:56.940 |
um, in, in, in, in, in these sorts of organizations. 01:07:04.860 |
a typical person can do in less than one second, right? 01:07:18.340 |
which is that if it's a task that a normal person can do with less than one second of thinking, 01:07:22.680 |
there's a very good chance we could automate it with deep learning. 01:07:27.220 |
given a picture, tell me if the face in this picture is smiling or frowning. 01:07:30.780 |
You don't need to think for more than a second. 01:07:32.960 |
So, yes, we can build deep learning systems and do that really well, right? 01:07:38.520 |
like, uh, uh, listen to this audio clip, what did they say? 01:07:41.060 |
You don't need to think for that long, it's less than a second. 01:07:43.360 |
So this is really a lot of the perception work, uh, 01:07:52.700 |
I think because humans just take time to re-text, 01:07:57.340 |
a bunch of product managers looking around for tasks that humans can do in less than one second, 01:08:03.980 |
but, but still useful rule of thumb. Um, there's a question? 01:08:22.080 |
I feel like a lot of the value of deep learning, 01:08:24.000 |
a lot, a lot of the, the, the concrete short-term applications, 01:08:29.240 |
trying to automate things that people can do. 01:08:31.440 |
Uh, really, especially people that do it in a very short time. 01:08:34.400 |
And, and this feeds into all the advantages, you know, 01:08:37.980 |
when, when you're trying to automate something that a human can already do. 01:08:48.580 |
Oh, I see. Oh, that's an interesting observation. 01:08:49.980 |
Oh, if a human can label in less than a second, 01:08:51.540 |
you can get a lot of data. Yeah, that's an interesting observation. 01:08:58.060 |
the, the other huge bucket of deep learning applications I've seen create tons of value, 01:09:18.460 |
if there's something that happens over and over, 01:09:21.560 |
maybe not super inspiring, we show a user an ad, right? 01:09:25.700 |
and, and the user clicks on it or doesn't click on it with 01:09:28.020 |
tons of data to predict if the user will click on the next ad. 01:09:30.580 |
Probably the most lucrative application of, uh, 01:09:38.660 |
if you order food from this restaurant to go to this destination at this time of day, 01:09:42.060 |
how long does it take? We've seen that a ton of times. 01:09:46.380 |
how long will it take to, to, to, to send this food to you. 01:09:50.500 |
I, you know, deep learning does so much stuff. 01:09:52.420 |
So I've struggled a bit to come up with simple rules to explain to, 01:09:59.280 |
I found these two rules useful even though I know these are 01:10:02.420 |
clearly highly flawed and there are many, many counter examples, right? 01:10:08.260 |
Um, so it's, it's exciting to find for deep learning because I think it's 01:10:14.340 |
It's also causing us to rethink how we organize the companies. 01:10:18.900 |
how we, the workflow of a process for, for, for, for our products. 01:10:22.620 |
I think there's a lot of excitement going on. 01:10:24.900 |
Um, the last thing I want to do is, um, you know, 01:10:28.220 |
I found that the number one question I get asked is, 01:10:32.920 |
um, uh, how do you build a career in machine learning, right? 01:10:38.920 |
And I think, um, you know, when I, when I did a Reddit Ask Me Anything, 01:10:43.920 |
a Reddit AMA, that was one of the questions that was asked. 01:10:46.680 |
Even today, a few people came up to me and said, you know, 01:10:50.640 |
you know, the machine learning MOOC on Coursera or something else. 01:10:52.960 |
Um, what advice do you have for building a career in machine learning? 01:10:56.560 |
I have to admit, I, I don't have an amazing answer to that, but 01:10:59.460 |
since I get asked that so often and because I really want to think what would be 01:11:04.020 |
the most useful content to you, I, I, I thought I'll at least attempt an answer, 01:11:07.660 |
even though it is maybe not a great one, right? 01:11:10.540 |
So this is the last thing I had, uh, at the start, 01:11:17.340 |
I was asking myself this same question, uh, uh, uh, uh, like a, 01:11:22.340 |
Which is, you know, after you've taken a machine learning course, um, 01:11:26.260 |
what's the next step for, um, developing your machine learning career? 01:11:33.720 |
the best thing would be if you attend deep learning school. 01:11:36.960 |
[LAUGH] So, so, so, so, Sammy, Peter and I got together to do this, I hope. 01:11:41.720 |
[LAUGH] Um, this is really part of motivation. 01:11:44.960 |
Um, and then, and then beyond that, right, what, 01:11:46.880 |
what are the things that, that, that really help? 01:11:49.160 |
So, I do have had, actually, I think all of our organizations have had quite a lot 01:11:53.640 |
of people want to move from non-machine learning into machine learning. 01:11:57.460 |
And when I look at the career paths, um, you know, 01:12:00.020 |
one common thing is after taking these courses, 01:12:05.660 |
I've seen, I have a lot of respect for Kaggle. 01:12:07.220 |
A lot of people actually participate in Kaggle and learn from the blogs there and 01:12:10.460 |
then, and then become better and better at it. 01:12:12.500 |
Um, but I wanna share with you one other thing that I haven't really shared. 01:12:15.100 |
Oh, by the way, almost everything I talk about today is, 01:12:17.180 |
is, is new content that I've never presented before, right? 01:12:28.680 |
So, I want to share with you, really, the, the, 01:12:31.640 |
one thing that is a PhD student process, right? 01:12:39.920 |
when I was teaching full time at Stanford, a lot of people would join Stanford and 01:12:43.040 |
ask me, you know, how do I become a machine learning researcher? 01:12:46.160 |
How do I have my own ideas on how to push the bleeding edge of machine learning? 01:12:50.480 |
And whether, you know, you're working in robotics or machine learning or, or 01:12:56.460 |
There's one PhD student process that I find has been incredibly reliable. 01:13:00.660 |
And and, and I'm gonna say it, and you may or may not trust it, but 01:13:06.500 |
I've seen this work so reliably so many times that I hope you take my word for 01:13:10.580 |
it, that this process reliably turns non-machine learning researchers into, 01:13:15.020 |
you know, very good machine learning researchers, which is and 01:13:18.700 |
there's no magic, really, read a lot of papers and work on replicating results. 01:13:27.160 |
And I think that the human brain is a remarkable device, you know? 01:13:32.080 |
People often ask me, how do you have new ideas? 01:13:34.240 |
And I find that if you read enough papers and replicate enough results, 01:13:38.360 |
you will have new ideas on how to push for this to the odds, right? 01:13:42.000 |
I, I don't know how the, I don't really, I don't know how the human brain works, but 01:13:45.680 |
I've seen this be an incredibly reliable process. 01:13:48.360 |
If you read enough papers and, you know, between 20 and 50 papers later, and 01:13:52.700 |
it's not one or two, it's more like 20 or maybe 50, you will start to have your own 01:13:56.060 |
ideas, and this has been, so you see Sammy's nodding his head. 01:13:58.820 |
This is an incredibly reliable process, right? 01:14:05.540 |
sometimes people ask me what working in AI is like. 01:14:09.060 |
And I think some people have this picture that when we work on AI, you know, 01:14:15.980 |
I think some people have this picture of us hanging out in these airy, 01:14:20.520 |
you know, well-lit rooms with natural plants in the background. 01:14:25.520 |
And we're all standing in front of a whiteboard discussing the future of 01:14:36.320 |
Frankly, almost all we do is dirty work, right? 01:14:42.840 |
one place that I've seen people get tripped up is when they think working on 01:14:46.700 |
AI is that future of humanity stuff, and shy away from the dirty work. 01:14:52.060 |
And dirty work means anything from going on the Internet and downloading data and 01:14:57.100 |
cleaning data, or downloading a piece of code and tuning parameters to see what 01:15:00.980 |
happens, or debugging your stack trace to figure out why this silly thing overflowed. 01:15:06.020 |
Or optimizing the database, or hacking a GPU kernel to make it faster, or 01:15:10.340 |
reading a paper and struggling to replicate the result. 01:15:13.040 |
At the end, a lot of what we do comes down to dirty work. 01:15:16.240 |
And yes, there are moments of inspiration, but 01:15:19.080 |
I've seen people really stall if they refuse to get into the dirty work. 01:15:23.720 |
So my advice to you is, and actually another place I've seen people stall is 01:15:29.160 |
if they only do dirty work, then you can become great at data cleaning, but 01:15:33.880 |
also not become better and better at having your own moments of inspiration. 01:15:38.360 |
So one of the most reliable formulas I've seen is really if you do both of these. 01:15:44.140 |
If your team needs you to do some dirty work, just go and do it. 01:15:50.660 |
And I think the combination of these two is the most reliable formula I've seen for 01:15:56.380 |
So I want to close with just one more story about this. 01:16:07.900 |
And I guess some of you may have heard me talk about the Saturday story, right? 01:16:13.720 |
But for those of you that want to advance your career in machine learning, 01:16:20.240 |
Next weekend, you can either stay at home and watch TV, or you could do this, right? 01:16:28.200 |
And it turns out this is much harder, and then no short term rewards for 01:16:32.080 |
If next weekend, I think this weekend you guys are all doing great. 01:16:39.720 |
if you spend next weekend studying, reading papers, refereeing results, 01:16:44.400 |
If you go to work the following Monday, your boss doesn't know what you did, 01:16:48.760 |
No one's gonna pat you on the back and say good job, 01:16:52.720 |
And realistically, after working really, really hard next weekend, 01:16:55.960 |
you're not actually that much better, you're barely any better at your job. 01:16:59.720 |
So there's pretty much no reward for working really, 01:17:04.760 |
But I think the secret to advancing your career is this. 01:17:08.920 |
If you do this not just on one weekend, but do this a weekend after weekend for 01:17:15.440 |
In fact, everyone I've worked with at Stanford that was close and 01:17:20.560 |
became great at this, everyone, actually including me, was a grad student. 01:17:25.000 |
We all spent late nights hunched over like a neural net tuning hyperparameters, 01:17:31.480 |
And it was that process of doing this not just one weekend, but weekend after weekend, 01:17:35.720 |
that allowed all of us really to, our brains, 01:17:40.200 |
neural networks to learn the patterns that taught us how to do this. 01:17:44.640 |
So I hope that even after this weekend, you keep on spending the time to keep 01:17:49.120 |
learning because I promise that if you do this for long enough, 01:17:51.680 |
you will become really, really good at deep learning. 01:17:55.160 |
So just to wrap up, I'm super excited about AI. 01:17:58.520 |
And making this analogy that AI is the new electricity, right? 01:18:02.280 |
And what I mean is that just as 100 years ago, 01:18:05.840 |
electricity transformed industry after industry, right? 01:18:08.880 |
Electricity transformed agriculture, manufacturing, transportation, 01:18:14.240 |
I feel like those of you that are familiar with AI are now in an amazing position 01:18:19.480 |
to go out and transform not just one industry, but potentially a ton of industries. 01:18:24.360 |
So I guess at Baidu, I have a fun job trying to transform not just one industry, 01:18:32.560 |
But I see that it's very rare in human history where one person, 01:18:40.720 |
where someone like you can gain the skills and 01:18:44.000 |
do the work to have such a huge impact on society. 01:18:47.720 |
I think in Silicon Valley, the phrase change the world is overused, right? 01:18:51.760 |
Every Stanford undergrad says I want to change the world. 01:18:54.240 |
But for those of you that work in AI, I think that the path from what you do to 01:18:58.200 |
actually having a big impact on a lot of people and helping a lot of people. 01:19:01.800 |
In transportation, in healthcare, in logistics, in whatever, 01:19:07.360 |
So I hope that all of you will keep working hard even after this weekend and 01:19:42.960 |
Okay, so let's break for today and look forward to seeing everyone tomorrow.