back to index

Nuts and Bolts of Applying Deep Learning (Andrew Ng)


Whisper Transcript | Transcript Only Page

00:00:00.000 | >> So when we're organizing this workshop, my co-organizers initially asked me,
00:00:05.120 | hey Andrew, end of the first day, go give a visionary talk.
00:00:07.680 | So until several hours ago, my talk was advertised as visionary talk.
00:00:12.640 | >> [LAUGH] >> But
00:00:13.920 | I was preparing for this presentation over the last several days.
00:00:18.640 | I've tried to think what would be the most useful information to you, and
00:00:23.320 | what are the things that you could take back to work on Monday and
00:00:26.040 | do something different at your job next Monday.
00:00:28.480 | And I thought that we need to set context right now, as Peter mentioned,
00:00:32.000 | I lead Baidu's AI team.
00:00:33.480 | So it's a team of about 1,000 people working on vision, speech, NLP,
00:00:37.520 | lots of applications and machine learning.
00:00:40.400 | And so what I thought I'd do instead is,
00:00:42.680 | instead of taking the shiniest pieces of deep learning that I know,
00:00:46.000 | I want to take the lessons that I saw at Baidu that are common to so
00:00:50.280 | many different academic areas, as well as applications,
00:00:54.000 | autonomous cars, augmented reality, advertising, web search, medical diagnosis.
00:00:59.360 | We'll take what I saw, the common lessons, the simple,
00:01:02.080 | powerful ideas that I've seen, how to drive a lot of machine learning progress at Baidu.
00:01:06.720 | And I thought I will share those ideas with you,
00:01:09.200 | because the patterns I see across a lot of projects, I thought might be the patterns
00:01:13.480 | that would be most useful to you as well,
00:01:15.440 | whatever you are working on in the next several weeks or months.
00:01:19.480 | So one common theme that will appear in this presentation today is that
00:01:25.360 | the workflow of organizing machine learning projects feels like
00:01:28.920 | parts of it are changing in the era of deep learning.
00:01:31.840 | So for example, one of the ideas I'll talk about is bias-variance.
00:01:34.720 | It's a super old idea, right?
00:01:36.520 | And then many of you, maybe all of you have heard of bias-invariance.
00:01:40.640 | But in the era of deep learning,
00:01:42.760 | I feel like there have been some changes to the way we think about bias-invariance.
00:01:46.280 | So I wanna talk about some of these ideas, which maybe aren't even deep learning per
00:01:49.920 | se, but have been slowly shifting as we apply deep learning to more and
00:01:54.480 | more of our applications, okay?
00:01:57.360 | And instead of holding all your questions until the end,
00:02:01.360 | if you have a question in the middle, feel free to raise your hand as well.
00:02:03.760 | I'm very happy to take questions in the middle,
00:02:05.720 | since this is a more maybe informal whiteboard talk, right?
00:02:08.760 | And also, say hi to our home viewers, hi.
00:02:11.000 | >> [LAUGH]
00:02:13.800 | >> So one question that I still get asked sometimes is, and
00:02:17.800 | then kind of Andre alluded to this earlier,
00:02:20.840 | a lot of the basic ideas of deep learning have been around for decades.
00:02:24.280 | So why are they taking off just now, right?
00:02:27.200 | Why is it that deep learning, these neural networks have all known for
00:02:30.680 | maybe decades, why are they working so well now?
00:02:33.800 | So I think that the one biggest trend in deep learning is scale,
00:02:39.200 | that scale drives deep learning progress.
00:02:41.440 | And I think Andre mentioned scale of data and scale of computation, and
00:02:45.960 | I'll just draw a picture that illustrates that concept maybe a little bit more.
00:02:50.840 | So if I plot a figure, where on the horizontal axis I plot
00:02:56.520 | the amount of data we have for a problem, and on the vertical axis we plot performance.
00:03:03.000 | Right, so x-axis is the amount of spam data you've collected,
00:03:06.160 | y-axis is how accurately can you classify spam.
00:03:10.080 | Then if you apply traditional learning algorithms,
00:03:13.920 | right, what we found was that the performance
00:03:18.640 | often looks like it starts to plateau after a while.
00:03:23.600 | It was as if the older generations of learning algorithms,
00:03:26.680 | including support, logistic regression, SVMs,
00:03:30.600 | was as if they didn't know what to do with all the data that we finally had.
00:03:34.000 | And what happened kind of over the last 20 years,
00:03:36.920 | last 10 years, was with the rise of the Internet, rise of mobile, rise of IoT.
00:03:42.320 | Where's the society sort of marched to the right of this curve, right, for
00:03:46.720 | many problems, not all problems.
00:03:48.840 | And so with all the buzz and all the hype about deep learning,
00:03:53.280 | in my opinion, the number one reason that deep learning algorithms work so
00:03:58.120 | well is that if you train, let me call it a small neural net,
00:04:04.360 | maybe you get slightly better performance.
00:04:07.120 | If you train a medium-sized neural net,
00:04:11.880 | right, maybe you get even better performance.
00:04:18.440 | And it's only if you train a large neural net that you could train a model with
00:04:24.720 | the capacity to absorb all this data that we don't have access to,
00:04:28.320 | that allows you to get the best possible performance.
00:04:30.840 | And so I feel like this is a trend that we're seeing in many verticals in
00:04:33.560 | many application areas.
00:04:35.280 | A couple of comments.
00:04:36.880 | One is that this, actually when I draw this picture, some people ask me, well,
00:04:42.400 | does this mean a small neural net always dominates a traditional learning algorithm?
00:04:46.200 | The answer is not really.
00:04:47.720 | Technically, if you look at the small data regime,
00:04:50.520 | if you look at the left end of this plot, right,
00:04:54.440 | the relative ordering of these algorithms is not that well defined.
00:04:57.480 | It depends on who's more motivated to engineer the features better, right?
00:05:00.840 | If the SVM guy is more motivated to spend more time engineering features,
00:05:05.520 | they might beat out the neural network application.
00:05:09.600 | But because when you don't have much data,
00:05:12.800 | a lot of the knowledge of the algorithm comes from hand engineering, right?
00:05:16.120 | But this trend is much more evident in a regime of big data where you just can't
00:05:20.040 | hand engineer enough features.
00:05:21.760 | And the large neural net combined with a lot of data tends to outperform.
00:05:26.040 | So a couple of comments.
00:05:29.400 | The implication of this figure is that in order to get the best performance,
00:05:33.200 | in order to hit that target, you need two things, right?
00:05:36.240 | One is you need to train a very large neural network,
00:05:39.280 | or reasonably large neural network, and you need a large amount of data.
00:05:45.040 | And so this in turn has caused pressure to train large neural nets, right?
00:05:51.520 | Build large nets as well as get huge amounts of data.
00:05:54.800 | So one of the other interesting trends I've seen is that
00:05:58.120 | increasingly I'm finding that it makes sense to build an AI team as well as
00:06:05.320 | build a computer systems team and have the two teams kind of sit next to each other.
00:06:09.320 | And the reason I say that is, I guess, so let's see.
00:06:13.320 | So when we started Baidu Research, we set our team that way.
00:06:16.560 | Other teams are also organized this way.
00:06:18.280 | I think Peter mentioned to me that OpenAI also has a systems team and
00:06:22.040 | a machine learning team.
00:06:23.720 | And the reason we're starting to organize our teams that way, I think,
00:06:26.120 | is that some of the computer systems work we do, right?
00:06:29.480 | So we have an HPCT team, high performance team,
00:06:31.800 | a super computing team at Baidu.
00:06:34.160 | Some of the extremely specialized knowledge in HPC is just incredibly difficult for
00:06:38.920 | an AI researcher to learn, right?
00:06:40.680 | Some people are super smart.
00:06:41.720 | Maybe Jeff Dean is smart enough to learn everything.
00:06:44.080 | But it's just difficult for any one human to be sufficiently expert in HPC and
00:06:49.000 | sufficiently expert in machine learning.
00:06:52.760 | And so we've been finding, and Shubo actually,
00:06:56.160 | one of the co-organizers on our HPC team, we've been finding that bringing
00:07:01.160 | knowledge from these multiple sources, multiple communities,
00:07:05.160 | allows us to get our best performance.
00:07:07.560 | You've heard a lot of fantastic presentations today.
00:07:12.320 | I want to draw one other picture, which is, in my mind,
00:07:16.160 | this is how I mentally bucket work in deep learning.
00:07:20.320 | So this might be a useful categorization, right?
00:07:22.600 | When you look at the talk,
00:07:23.520 | you can mentally put each talk into one of these buckets I'm about to draw.
00:07:27.640 | But I feel like there's a lot of work on, I'm gonna call, you know,
00:07:31.400 | general DL, general models.
00:07:33.160 | And this is basically what the type of model that Hugo La Rochelle talked about
00:07:36.640 | this morning, where you have, you know, really densely connected layers, right?
00:07:42.240 | I guess FC, right, was the- so there's a huge bucket of models there.
00:07:47.800 | And then I think a second bucket is sequence models.
00:07:51.400 | So 1D sequences, and this is where I would bucket a lot of the work on RNNs,
00:08:00.800 | you know, LSTMs, right, GRUs, some of the attention models,
00:08:05.760 | which I guess probably Yoshua Benjamin's talk about tomorrow, or maybe others,
00:08:09.360 | maybe Kwaku, I'm not sure, right?
00:08:11.480 | But so the 1D sequence models is another huge bucket.
00:08:14.840 | And the third bucket is the image models.
00:08:17.720 | This is really 2D and maybe sometimes 3D, but this is where I would tend to
00:08:22.080 | bucket all the work of CNNs, convolutional nets.
00:08:26.480 | And then in my mental bucket, then there's a fourth one, which is the other, right?
00:08:31.320 | And this includes unsupervised learning, you know,
00:08:34.760 | the reinforcement learning, right, as well as lots of other creative ideas,
00:08:40.280 | being explored in the training.
00:08:41.880 | You know, like, what's, I still find slow feature analysis,
00:08:45.040 | sparse coding, ICA, various models kind of in the other category, super exciting.
00:08:52.080 | So it turns out that if you look across industry today,
00:08:55.600 | almost all the value today is driven by these three buckets, right?
00:09:03.040 | So what I mean is those three buckets of algorithms are driving,
00:09:08.280 | causing us to have much better products, right, or monetizing very well.
00:09:12.280 | It's just incredibly useful for lots of things.
00:09:15.480 | In some ways, I think this bucket might be the future of AI, right?
00:09:18.640 | So I find unsupervised learning especially super exciting.
00:09:21.600 | So I'm actually super excited about this as well.
00:09:25.240 | Although I think that if on Monday you have a job and you're trying to build
00:09:29.520 | a product or whatever, the chance of you using something from one of these three
00:09:32.760 | buckets will be highest.
00:09:35.040 | But I definitely encourage you to contribute to research here as well, right?
00:09:39.320 | So I said that trend one, the major trend one of deep learning is scale.
00:09:47.840 | This is what I would say is maybe major trend two, of two of two trends.
00:09:52.680 | This is not gonna go on forever, right?
00:09:55.080 | I feel major trend two is the rise of end-to-end deep learning,
00:10:02.040 | especially for rich outputs.
00:10:05.280 | And so end-to-end deep learning,
00:10:07.480 | I'll say a little bit more in a second exactly what I mean by that.
00:10:10.120 | But the examples I'm gonna talk about are all from one of these three buckets,
00:10:13.040 | right, general DL, sequence models, image 2D, 3D models.
00:10:16.960 | But let's see, it's best illustrated with a few examples.
00:10:21.400 | Until recently, a lot of machine learning used to output just row numbers.
00:10:26.240 | So I guess in Richard's example, you have a movie review, right?
00:10:32.000 | And then, actually, I prepared totally different examples.
00:10:34.760 | I was editing my examples earlier to be more coherent with the speakers before me.
00:10:39.840 | We have a movie review and then output the sentiment.
00:10:42.920 | Is this a positive or a negative movie review?
00:10:45.880 | Or you might have an image, right?
00:10:48.400 | And then you want to do image net object recognition.
00:10:51.840 | So this would be a 0, 1 output.
00:10:53.760 | This might be an integer from 1 to 1,000.
00:10:56.560 | But so until recently,
00:10:57.680 | a lot of machine learning was about outputting a single number,
00:11:00.400 | maybe a row number, maybe an integer.
00:11:02.400 | Um, and I think the, the,
00:11:04.440 | the number two major trend that I'm really excited about is, um,
00:11:08.440 | enter in deep learning algorithms that can output much more complex things than numbers.
00:11:13.280 | And so one example that you've seen is, uh,
00:11:16.120 | image captioning where instead of taking an image and saying this is a cat,
00:11:20.200 | you can now take an image and output, you know,
00:11:22.880 | an entire string of text using RNN to generate that sequence.
00:11:26.680 | So I guess what, uh,
00:11:28.600 | Andre who spoke just now,
00:11:30.880 | I think, uh, Aurel Vandals, uh,
00:11:32.960 | uh, Shu Wei at Baidu, right?
00:11:34.520 | A whole bunch of people have, have, have worked on this problem.
00:11:37.160 | Um, one of the things that I guess, uh, my, my,
00:11:41.080 | my collaborator Adam Coates will talk about tomorrow, uh,
00:11:44.080 | maybe Kwok as well, not sure,
00:11:45.680 | is, um, speech recognition where you take as input audio and you directly output,
00:11:51.080 | you know, the text transcript, right?
00:11:54.640 | And so, um, when we first proposed using this kind of
00:11:58.320 | end-to-end architecture to do speech recognition, this is very controversial.
00:12:02.200 | We're building the work of Alex Graves, uh,
00:12:04.360 | but the idea of actually putting this in the production speech system was very,
00:12:07.680 | very controversial when we first, you know,
00:12:09.800 | said we wanted to do this.
00:12:11.000 | But I think the whole community is coming around to this point of view more recently.
00:12:15.400 | Um, or, you know, machine translation,
00:12:18.760 | say go from English to French, right?
00:12:20.640 | So, uh, Ilya Saskova, Kwok Leur, uh,
00:12:22.880 | others, uh, working on this,
00:12:24.840 | a lot of teams now, um, or, you know,
00:12:27.440 | given the parameters, um,
00:12:30.840 | synthesize a brand new image, right?
00:12:33.600 | And, and, and you saw some examples of image synthesis.
00:12:36.320 | So I feel like the, the,
00:12:38.280 | the second major trend of,
00:12:40.440 | of, of deep learning that I find very exciting and, and, I mean,
00:12:44.040 | this allowing us to build, you know,
00:12:45.480 | transformative things that we just couldn't build three or four years ago,
00:12:48.440 | has in this trend to what not just learning algorithms and output,
00:12:52.480 | not just a number,
00:12:53.600 | but it can output very complicated things like a sentence or a caption or
00:12:57.720 | French sentence or image or, or, or,
00:13:00.000 | or like the recent WaveNet paper output audio, right?
00:13:02.640 | So I think this is a, maybe the second,
00:13:04.840 | um, major trend.
00:13:07.560 | So, um, despite all the excitement,
00:13:14.280 | um, about end-to-end deep learning, um,
00:13:17.920 | I think that end-to-end deep learning, you know,
00:13:20.360 | sadly is not the solution to everything.
00:13:22.360 | Um, I wanna give you some rules of thumb for deciding when to use,
00:13:25.480 | what is exactly end-to-end deep learning and when to use it and when not to use it.
00:13:28.360 | So, so I'm moving to the second bullet and we'll, we'll, we'll go through these.
00:13:32.760 | So the trend towards end-to-end deep learning has been, um,
00:13:40.760 | this idea that instead of engineering a lot of intermediate representations,
00:13:45.840 | maybe you can go directly from your raw input to whatever you wanna predict, right?
00:13:51.520 | So for example, actually it's a tick because I'm,
00:13:53.520 | I'm gonna use speech as a recurring example.
00:13:55.640 | Uh, so for speech recognition,
00:13:58.360 | um, previously one used to go from the audio to, you know,
00:14:03.920 | hand-engineered features like MFCCs or something and then maybe extract phonemes,
00:14:09.120 | right? Um, and then eventually you try to generate the transcript.
00:14:14.720 | Um, for those of you that aren't sure what a phoneme is.
00:14:18.320 | So, uh, if you look at the word,
00:14:20.200 | listen to the word cat and the word kick,
00:14:22.720 | the k sound, right, is the same sound.
00:14:25.440 | And so phonemes are this,
00:14:27.280 | um, basic units of sound such as k as a phoneme,
00:14:31.040 | uh, and is, um, hypothesized by linguists to be the basic units of sound.
00:14:35.040 | So k, a, t would be the,
00:14:37.120 | maybe the three phonemes that make up the word cat, right?
00:14:40.040 | So traditional speech systems used to,
00:14:42.160 | used to do this, uh,
00:14:43.640 | and I think 2011 Lee Dang and Jeff Hinton, um,
00:14:46.640 | made a lot of progress in speech recognition by saying we can use deep learning to do this first step.
00:14:51.880 | Um, but the end-to-end approach to this would be to say,
00:14:55.440 | let's forget about phonemes,
00:14:57.320 | let's just have a neural net,
00:15:00.720 | right, input the audio and output the transcript.
00:15:04.520 | Um, so it turns out that in some problems,
00:15:09.200 | there's end-to-end approach.
00:15:10.440 | So one end is the input,
00:15:11.800 | the other end is the output.
00:15:13.080 | So the phrase end-to-end deep learning refers to, uh,
00:15:15.680 | just having a neural net or, you know,
00:15:17.400 | like a learning algorithm directly go from input to output.
00:15:20.040 | That's, that's what end-to-end means.
00:15:21.440 | Um, this end-to-end formula, uh,
00:15:24.160 | is, I think it makes for, what,
00:15:26.080 | great PR, uh, and,
00:15:27.680 | and it's actually very simple but it only works sometimes.
00:15:30.960 | Um, and actually, maybe, maybe,
00:15:32.720 | yeah, I'll just tell you this interesting story.
00:15:34.800 | You know, this end-to-end story really upset a lot of people.
00:15:37.920 | Um, when we were doing this work,
00:15:40.000 | I guess, I used to go around and say,
00:15:42.240 | I think phonemes are a fantasy of linguists.
00:15:45.200 | Um, and we should do away with them.
00:15:47.200 | And I still remember there was a meeting at Stanford,
00:15:49.480 | some of you know who it was,
00:15:50.480 | there was a linguist kind of yelling at me in public, uh, for saying that.
00:15:53.960 | So maybe, maybe I should not,
00:15:55.720 | uh, but we turned out to be right, you know, so.
00:15:57.800 | [LAUGHTER]
00:16:02.000 | All right. Um, so let's see.
00:16:06.080 | Um, but the, the,
00:16:08.160 | the Achilles heel of a lot of deep learning is that you need tons of label data, right?
00:16:13.080 | So if, if this is your x and that's your y,
00:16:15.920 | then for end-to-end deep learning to work,
00:16:17.840 | you need a ton of label,
00:16:19.600 | you know, input output data, x, y.
00:16:22.360 | So to take an example where,
00:16:25.240 | um, uh, where, you know,
00:16:26.600 | one may or may not consider end-to-end deep learning.
00:16:28.640 | Um, this is a problem I learned about just last week from Curtis Langlis and,
00:16:33.000 | and Darwin who's in the audience, I think,
00:16:35.000 | of, uh, imagine you want to use, um,
00:16:37.880 | x-ray pictures of your hand in order to predict the child's age, right?
00:16:41.080 | So this is a real thing, you know,
00:16:42.240 | doctors actually care to look at an x-ray of your,
00:16:45.120 | of a child's hand in order to predict the, the age of the child.
00:16:48.120 | So, um, boy, let me draw an x-ray image, right?
00:16:51.440 | So this is, you know,
00:16:53.160 | the child's hand. So these are the bones, right?
00:16:56.320 | I guess. This is why I'm not a doctor.
00:17:00.000 | Okay. So that's a hand and, and, and you see the bones.
00:17:03.240 | Um, and so more traditional algorithm might input an image,
00:17:08.680 | and then first, you know,
00:17:10.920 | extract the bones.
00:17:12.160 | So first figure out, oh,
00:17:14.000 | there's a bone here, there's a bone here,
00:17:16.320 | there's a bone here, and then maybe measure the length of these bones, right?
00:17:21.120 | Um, so really I'm gonna say bone lengths,
00:17:25.520 | and then maybe have some formula,
00:17:28.280 | like some linear regression, average,
00:17:30.040 | some simple thing to go from the bone length to estimate the age of the child, right?
00:17:34.640 | So this is a non-end-to-end approach to try and solve this problem.
00:17:37.960 | Um, an end-to-end approach would be to take an image,
00:17:40.640 | and then, you know,
00:17:42.160 | run a convnet or whatever,
00:17:43.760 | and just try to output the age of the child.
00:17:45.800 | And I think this is one example of a problem where, um,
00:17:49.560 | it's very challenging to get end-to-end deep learning to work,
00:17:52.880 | because you just don't have enough data.
00:17:55.000 | You just don't have enough X-rays of children's hands annotated with their ages.
00:17:59.560 | And instead, where we see deep learning coming in is in this step, right?
00:18:05.600 | To use- go from image to,
00:18:07.240 | to figure out where the bones are,
00:18:08.520 | use deep learning for that.
00:18:09.840 | But the advantage of this non-end-to-end architecture is it
00:18:13.600 | allows you to hand engineer in more information about the system,
00:18:17.280 | such as how bone lengths map to age, right?
00:18:19.840 | Which- which you can kind of get tables about.
00:18:21.760 | Um, there are a lot of examples like this.
00:18:24.400 | And I think one of the unfortunate things about deep learning is that, um, let's see.
00:18:29.880 | Uh, you know, you can- for- for- for suitably sexy values of X and Y,
00:18:35.640 | you could almost always train a model and publish a paper.
00:18:39.360 | But that doesn't always mean that,
00:18:41.160 | you know, it's actually a good idea. Peter?
00:18:43.120 | [inaudible]
00:18:51.960 | I see. Yeah. I see.
00:18:54.360 | Yeah. I see.
00:18:56.360 | Yes, that's true. Yes.
00:18:57.560 | So Peter's pointing out that in practice,
00:18:59.640 | you could, um, uh,
00:19:01.240 | if this is a fixed function F, right?
00:19:03.040 | You could backprop all the way from the age,
00:19:05.120 | all the way back to the image.
00:19:06.480 | Yeah, that's a good idea actually.
00:19:07.920 | Um, who was it just now who said you better do it quickly?
00:19:10.320 | [LAUGHTER]
00:19:12.360 | Yeah. Um, let me give a couple of other examples, uh, uh,
00:19:15.600 | that where- where it might be harder to backprop all the way through, right?
00:19:18.280 | So here's- here's an example.
00:19:19.640 | Um, take self-driving cars.
00:19:21.440 | You know, most teams are using an architecture where you input an image,
00:19:25.160 | you know, what's in front of the car, let's say.
00:19:27.040 | And then you, you know,
00:19:28.320 | detect other cars, right?
00:19:30.160 | Uh, and then- and- and maybe use the image to detect pedestrians, right?
00:19:35.440 | Self-driving cars are obviously more complex than this, right?
00:19:38.000 | Uh, but then now that you know where the other cars and where
00:19:40.320 | the pedestrians are relative to your car,
00:19:42.200 | you then have a planning algorithm, uh, uh,
00:19:45.400 | to then, you know, come up with a trajectory, right?
00:19:50.560 | And then now that you know, um,
00:19:53.100 | what's the trajectory that you want your car to drive through, um,
00:19:58.320 | you could then, you know,
00:19:59.680 | compute the steering direction, right?
00:20:02.800 | Let's say. And so, um,
00:20:08.240 | this is actually the architecture that most self-driving car teams are using.
00:20:11.920 | Um, and you know,
00:20:13.800 | there have been interesting approaches to- to say,
00:20:16.880 | well, I'm gonna input an image and I'll put a steering direction, right?
00:20:22.000 | And I think this is an example of where, um,
00:20:25.400 | at least with today's state of technology,
00:20:27.760 | I'd be very cautious about this second approach because- and I think,
00:20:31.520 | if you have enough data,
00:20:33.120 | the second approach will work and you can even prove a theorem,
00:20:35.880 | you know, showing that it will work, I think.
00:20:37.400 | But, um, I don't know that anyone today has
00:20:40.280 | enough data to make the second approach really, really work well, right?
00:20:44.040 | And- and I think kind of the- the- Peter made a great comment just now.
00:20:46.720 | And I think, you know, some of these components will be incredibly complicated,
00:20:50.800 | you know, like this could be a power plan of an ex- explicit search.
00:20:53.640 | And you could actually design a really complicated power plan and generate the trajectory.
00:20:58.140 | And your ability to hand code that still has a lot of value, right?
00:21:02.540 | So this is one thing to watch out for.
00:21:04.620 | Um, I have seen project teams say,
00:21:07.100 | I can get X, I can get Y,
00:21:08.920 | I'm gonna train deep learning.
00:21:10.620 | Um, but unless you actually have the data, you know,
00:21:13.620 | some of these things make for great demos if- if you cherry pick the examples.
00:21:17.440 | But- but it can be challenging to, um, get to work at scale.
00:21:20.740 | I- I should say, if I sell driving cars,
00:21:22.380 | this debate is still open.
00:21:23.520 | I'm- I'm cautious about this.
00:21:24.980 | I don't think it's a- I don't think this will necessarily fail.
00:21:27.900 | I just think the data needed to do this will be- will be really immense.
00:21:31.620 | So I- I'd be very cautious about it and then right now.
00:21:35.220 | But it might work if you have enough data.
00:21:37.100 | Um, so, you know,
00:21:40.380 | one of the themes that comes up in machine learning,
00:21:44.140 | or really if you're working on a machine learning project,
00:21:46.660 | one thing that will often come up is, um,
00:21:49.420 | you will, you know,
00:21:51.980 | develop a learning system, uh,
00:21:54.420 | train it, maybe it doesn't work as well as you're hoping yet.
00:21:57.360 | And the question is, what do you do next, right?
00:21:59.860 | This is a very common part of a machine learning, you know,
00:22:02.620 | a researcher or a machine learning engineer's life,
00:22:04.620 | which is, you know, you- you- you train a model,
00:22:07.020 | doesn't do what you want it to yet,
00:22:08.540 | so what do you do next, right?
00:22:09.860 | This happens to us all the time.
00:22:11.140 | Um, and you face a lot of choices.
00:22:12.980 | You could collect more data,
00:22:14.260 | maybe you want to train it longer,
00:22:15.700 | maybe you want a different neural network architecture,
00:22:17.900 | maybe you want to try regularization,
00:22:19.380 | maybe you want a bigger model,
00:22:20.900 | or run some more GPUs.
00:22:22.020 | So you have a lot of decisions.
00:22:23.460 | And I think that, um,
00:22:24.980 | a lot of the skill of a machine learning researcher or
00:22:27.320 | a machine learning engineer is knowing how to make these decisions, right?
00:22:30.640 | And- and- and the difference in performance and whether you, you know,
00:22:33.720 | do you train a bigger model or do you try regularization,
00:22:36.960 | your skill at picking these decisions will have a huge impact on how
00:22:40.160 | rapidly, um, uh, you can make progress on an actual machine learning problem.
00:22:44.800 | So, um, I want to talk a little about bias and variance,
00:22:49.000 | since that's one of the most basic,
00:22:50.560 | you know, concepts in machine learning.
00:22:52.400 | And I feel like it's evolving slightly in the era of- of- of deep learning.
00:22:56.500 | So to use a- as a motivating example, um,
00:22:59.260 | let's say the goal is to build a human level,
00:23:04.020 | right, uh, speech system, right?
00:23:09.440 | Speech recognition system, okay?
00:23:12.240 | So, um, what we would typically do,
00:23:15.880 | especially in academia, is we'll get a dataset, you know,
00:23:19.040 | here's my dataset with a lot of examples,
00:23:20.880 | and then we shuffle it and we randomly split it into 70-30 trained tests,
00:23:25.120 | or maybe- or maybe 70% trained,
00:23:27.880 | you know, 15% dev,
00:23:29.840 | and, uh, 15% test, right?
00:23:31.720 | We- we take- oh, and, uh,
00:23:33.780 | some people use the term validation set,
00:23:35.840 | but I'm- I'm just gonna use the set- the dev set,
00:23:38.080 | or it stands for development set,
00:23:39.500 | means the same thing as validation set.
00:23:41.000 | Okay, so it's pretty common.
00:23:42.400 | Um, and so what we would- what- what- what I encourage you to do if you aren't already,
00:23:48.920 | is to measure the following things.
00:23:52.280 | Um, human level error.
00:23:55.320 | So let- let- actually let me illustrate an example.
00:23:58.480 | Let's say that on your dev- uh, uh,
00:24:01.280 | let's say that, um, on your dev set,
00:24:03.900 | you know, human level error is, uh, 1% error.
00:24:07.560 | Um, let's say that your training set error is,
00:24:14.280 | um, let me use 5%,
00:24:16.880 | and let's say that your dev set error,
00:24:20.040 | really, right, dev set is a proxy for test set except you tune to the dev set, right?
00:24:25.720 | Is, um, you know, 6% error.
00:24:28.760 | Okay. So this is one of the most basics- this- this is really a,
00:24:33.920 | a step in developing a learning algorithm that I encourage you to do if you aren't already,
00:24:37.800 | to figure out what are these three numbers.
00:24:39.740 | Because these three numbers, um,
00:24:42.080 | really helps in terms of telling you what to do next.
00:24:45.080 | So in this example, um,
00:24:46.880 | you see that you're doing much worse than human level performance.
00:24:49.640 | Um, and so you see that there's a huge gap here from 1% to 5%.
00:24:53.760 | And I'm gonna call this, you know,
00:24:56.080 | right, the bias of your learning algorithm.
00:24:58.640 | Um, and for the statisticians in the room,
00:25:00.920 | I'm using the terms bias and variance informally and
00:25:03.280 | doesn't correspond exactly to the way they're defined in textbooks.
00:25:06.200 | But I find these useful concepts for- for- for deciding how to make progress on your problem.
00:25:11.280 | Um, and so I would say that, you know,
00:25:14.120 | in this example, you have a high bias classifier.
00:25:16.440 | You try training a bigger model,
00:25:17.800 | maybe try training longer.
00:25:18.920 | We'll come- come back to this in a second.
00:25:20.560 | Um, for a different example,
00:25:23.520 | you know, so this is one example.
00:25:25.240 | Uh, for a different example,
00:25:26.580 | if human level error is 1% and, uh,
00:25:30.360 | training set error were 2%, right,
00:25:33.240 | and depth set error was 6%,
00:25:35.460 | then, you know, you really have a high, what,
00:25:38.180 | variance problem, right, like an overfitting problem.
00:25:41.680 | And this tells you- this really tells you what to do, what to try, right?
00:25:45.840 | Try adding regularization or try, um, uh,
00:25:49.240 | or try early stopping or, um,
00:25:51.280 | or even better, we get more data, right?
00:25:54.520 | Um, and then there's also really a third case which is if you have,
00:25:59.360 | uh, 1% human level error, um,
00:26:03.760 | I'm gonna say 6% depth set error.
00:26:07.360 | Oh, actually, I'm gonna say 5% depth set error,
00:26:10.400 | and, uh, 10%, um, excuse me,
00:26:13.400 | 5% training error and 10% depth set error.
00:26:16.640 | And in this case, you have high bias and high variance, right?
00:26:20.120 | Um, so- so I guess, yeah,
00:26:23.960 | high bias and high variance, you know, like sucks for you, right?
00:26:26.840 | Um, so I feel like that when I talk to applied machine learning teams,
00:26:32.600 | there's one really simple workflow, um,
00:26:37.840 | that is enough to help you make a lot of
00:26:41.200 | decisions about what you should be doing on your machine learning application.
00:26:45.400 | Um, and by- if- if you're wondering why I'm talking about this and what this has to do with deep learning,
00:26:50.280 | I'll come back to this in a second, right?
00:26:51.640 | Does this change in error deep learning?
00:26:53.280 | But, uh, uh, I feel like there's this, you know,
00:26:55.560 | almost a workflow, like almost a- a flow chart, right?
00:26:58.360 | Which is first ask yourself, um,
00:27:02.000 | is your training error high?
00:27:06.480 | Oh, and I hope I'm writing big enough that people can see it.
00:27:10.040 | If- if you have trouble reading,
00:27:11.600 | let me know and I'll- and I'll read it back out, right?
00:27:14.240 | But first I'll ask, you know,
00:27:15.600 | are you even doing well in your training set?
00:27:17.560 | Um, and- and- and if your training error is high,
00:27:20.880 | then you know you have high bias.
00:27:23.040 | And so your standard tactics like train a bigger model,
00:27:26.560 | just train a bigger neural network, um,
00:27:29.080 | or maybe try training longer, you know,
00:27:31.640 | make sure that your- your optimization algorithm is- is doing a good enough job.
00:27:35.280 | Um, and then there's also this magical one which is a new model architecture,
00:27:38.880 | which is a hard one, right?
00:27:40.840 | Um, come back to that in a second, okay?
00:27:43.880 | And then you kind of keep doing that until you're doing well at least on your training set.
00:27:49.520 | Once you're at least doing well on your training set,
00:27:51.840 | so your training error is no longer high.
00:27:53.480 | So no, training error is not unacceptably high.
00:27:56.160 | Um, we then ask, you know,
00:27:58.320 | is your depth error high, right?
00:28:03.360 | And if the answer is yes, then, um, well,
00:28:08.800 | if your depth set error is high,
00:28:10.400 | then you have a high variance problem,
00:28:12.600 | you have an overfitting problem.
00:28:14.120 | And so, you know, the solutions are try to get more data,
00:28:19.200 | right, or add regularization,
00:28:22.880 | or try a new model architecture, right?
00:28:33.280 | And then until- and- and you kind of keep doing this until your,
00:28:38.400 | uh, depth set error is- is- is- is no- is- I guess
00:28:42.520 | until both you're doing well on your training set and on your depth set.
00:28:45.960 | And then, you know, hopefully, right, you're done.
00:28:49.720 | So I think one of the- um,
00:28:52.760 | one of the nice things about this era of deep learning is that no matter- it's kind of,
00:28:57.840 | no matter where you're stuck with modern deep learning tools,
00:29:01.540 | we have a clear path for making progress in a way that was not true,
00:29:05.300 | or at least was much less true in the era before deep learning,
00:29:08.300 | which is in particular,
00:29:09.860 | no matter what your problem is,
00:29:11.600 | overfitting or underfitting, uh,
00:29:13.100 | really high bias or high variance or maybe both, right?
00:29:15.620 | You always have at least one action you can take,
00:29:18.380 | which is bigger model or more data.
00:29:20.860 | So you could- so- so- so in the deep learning era,
00:29:23.940 | relative to say the logistic regression error,
00:29:26.100 | the SVM error, it feels like we more often have
00:29:29.060 | a way out of whatever problem we're stuck in.
00:29:31.820 | Um, and so I feel like these days,
00:29:33.500 | people talk less about bias-variance trade-off.
00:29:35.960 | You might have heard that term, bias-variance trade-off,
00:29:38.060 | underfitting versus overfitting.
00:29:40.020 | And the reason we talked a lot about that in the past was
00:29:42.440 | because a lot of the moves available to us like tuning regularization,
00:29:46.700 | that really traded off bias and variance.
00:29:48.900 | So it was like a, you know,
00:29:50.060 | zero something, right?
00:29:51.140 | And you- you could improve one,
00:29:53.000 | but that makes the other one worse.
00:29:54.460 | But in the era of deep learning,
00:29:56.020 | really one of the reasons I think deep learning has been so powerful,
00:29:58.320 | is that the coupling between bias and variance can be weaker.
00:30:02.380 | And we now have tools,
00:30:03.820 | we now have better tools to, you know,
00:30:05.940 | reduce bias without increasing variance or reduce variance without increasing bias.
00:30:09.260 | And really the bigger- the- the- the big one is really,
00:30:11.520 | you can always train a bigger model,
00:30:13.140 | bigger neural network in a way that was harder to do when you're
00:30:16.380 | training which is regression is to come up with more and more features, right?
00:30:19.340 | So that was just harder to do.
00:30:21.180 | Um, so let's see.
00:30:26.340 | One of the- and I'm gonna add more to this diagram at the bottom in a second, okay?
00:30:32.200 | Um, one of the effects of this,
00:30:39.000 | maybe this- and- and- and by the way,
00:30:40.960 | I've been surprised, I mean, honestly,
00:30:42.920 | um, this new model architecture,
00:30:45.760 | that's really hard, right?
00:30:46.840 | It takes a lot of experience,
00:30:47.840 | but- but even if you aren't super experienced with,
00:30:50.720 | you know, a variety of deep learning models,
00:30:52.640 | the things in the blue boxes,
00:30:53.800 | you can often do those and that will drive a lot of progress, right?
00:30:57.080 | But if you have experience with, you know,
00:30:59.060 | how to tune a confident versus a resonant versus whatever,
00:31:01.940 | by all means, try those things as well.
00:31:03.580 | Definitely encourage you to keep mastering those as well.
00:31:05.600 | But this dumb formula of more data,
00:31:07.980 | bigger- bigger model, more data is enough to do very well on a lot of problems.
00:31:13.820 | So, um, let's see.
00:31:18.220 | Oh, so bigger model puts pressure on, you know,
00:31:22.420 | systems which is why we- we have high-performance computing team.
00:31:26.220 | Um, more data has led to another interesting,
00:31:30.540 | um, set of investments.
00:31:32.340 | So, uh, with, you know,
00:31:34.020 | I guess a lot of us have always what needed,
00:31:36.540 | that had this insatiable hunger for data,
00:31:38.420 | we use, you know,
00:31:39.340 | crowdsourcing for labeling, um, uh,
00:31:41.860 | we try to come up with all sorts of clever ways to come- to- to- to get data.
00:31:45.940 | Um, one- one area that- that I'm seeing more and more activity in, right?
00:31:50.460 | It feels a little bit nascent,
00:31:51.860 | but I'm seeing a lot of activity in is,
00:31:53.660 | um, automatic data synthesis, right?
00:31:56.460 | Um, let's see. And so here's what I mean.
00:32:01.180 | You know, once upon a time,
00:32:03.060 | people used to hand engineer features,
00:32:04.860 | and there was a lot of skill in hand engineering the features of, you know,
00:32:07.980 | like the SIF or the HOG or whatever to feed into SVM.
00:32:11.540 | Um, automatic data synthesis is this little area that is small,
00:32:16.940 | but feels like it's growing,
00:32:18.220 | where there is some hand engineering needed,
00:32:20.340 | but I'm seeing quite a lot of progress in multiple problems is enabled by
00:32:24.380 | hand engineering, uh, synthetic data in order to feed into the giant mole of your neural network.
00:32:30.540 | All right. So let me- let me best illustrate it with a couple of examples.
00:32:33.340 | Um, one of the easy ones is, uh, OCR.
00:32:36.140 | So- so let's say you want to train a, um,
00:32:37.940 | optical character recognition system,
00:32:39.660 | and actually I've been surprised at Baidu.
00:32:41.260 | This has tons of users, actually.
00:32:42.940 | This is one of my most useful APIs at Baidu, right?
00:32:45.860 | Um, if you imagine firing up Microsoft Word,
00:32:49.860 | um, and downloading a random picture off the Internet,
00:32:53.300 | then choose a random Microsoft Word font,
00:32:55.700 | choose a random word in the English dictionary,
00:32:58.260 | and just type the English word into Microsoft Word in a random font,
00:33:02.580 | and paste that on top, you know,
00:33:04.340 | like a transparent background on top of a random image off the Internet,
00:33:07.500 | then you just synthesize a training example for OCR, right?
00:33:11.020 | Um, and so this gives you access to essentially unlimited amounts of data.
00:33:15.100 | It turns out that the simple idea I just described won't work in its natural form.
00:33:19.300 | You actually need to do a lot of tuning to blur the synthesized text of the background,
00:33:24.340 | to make sure the color contrast matches your training distribution.
00:33:27.460 | So found in practice can be a lot of work to fine-tune how you synthesize data.
00:33:31.620 | But I've seen in many verticals,
00:33:34.180 | um, and I'll give a few examples.
00:33:35.860 | If you do that engineering work,
00:33:37.700 | and sadly it's painful engineering,
00:33:39.180 | you could actually get a lot of progress.
00:33:40.980 | Actually, actually Tao Wang, uh,
00:33:42.620 | who was a student here at Stanford, um, uh,
00:33:45.860 | the effect I saw was he engineered this for
00:33:48.340 | months with very little progress and then suddenly he got the parameters right,
00:33:52.620 | and he had huge amounts of data and was able to
00:33:55.180 | build one of the best OCR systems in the world at that time, right?
00:33:58.780 | Um, other examples, speech recognition, right?
00:34:04.980 | One of the most powerful ideas, uh,
00:34:07.140 | for building a, uh, effective speech system is if you take clean audio, you know,
00:34:11.740 | this is like a clean relatively noiseless audio,
00:34:14.900 | and take random background sounds and just synthesize what
00:34:18.820 | that person's voice would sound like in the presence of that background noise, right?
00:34:23.700 | And this turns out to work remarkably well.
00:34:25.260 | So if you record a lot of car noise,
00:34:26.620 | what the inside of your car sounds like,
00:34:28.380 | and record a lot of clean audio of someone speaking in a quiet environment,
00:34:32.500 | um, the mathematical operation is actually addition,
00:34:35.140 | the superposition of sound,
00:34:36.300 | but you basically add the two waveforms together,
00:34:38.660 | and then you get an audio clip that sounds like that person talking in the car,
00:34:41.980 | and you feed this to your learning algorithm.
00:34:43.580 | And so this has a dramatic effect in,
00:34:46.300 | in terms of amplifying the training set for speech recognition,
00:34:49.060 | and has a huge effect, can have a huge- we found a huge effect on, um, performance.
00:34:54.340 | Um, and then also NLP.
00:34:55.820 | You know, here's, here's one example actually done by, uh,
00:34:58.260 | some Stanford students which is, um,
00:35:00.340 | using end-to-end deep learning to do grammar correction.
00:35:03.260 | So input a ungrammatical English sentence, you know,
00:35:06.540 | maybe written by a non-native speaker, right?
00:35:08.620 | And can you automatically have a, have a,
00:35:11.020 | I guess attention RNN,
00:35:12.540 | input an ungrammatical sentence and correct the grammar,
00:35:15.180 | just edit the sentence for me.
00:35:16.860 | Um, and it turns out that you can synthesize
00:35:18.740 | huge amounts of this type of data automatically.
00:35:20.820 | And so there'll be another example where data synthesis, um, works very well.
00:35:25.860 | Um, and, oh, and I think, uh,
00:35:28.060 | uh, video games and RL, right?
00:35:30.380 | Really, one of the, um,
00:35:31.940 | well, let me say games broadly, right?
00:35:33.980 | One of the most powerful, um, uh,
00:35:36.580 | applications of RL, deep RL these days is video games.
00:35:39.620 | And I think if you think supervised learning has
00:35:41.940 | an insatiable hunger for data wait till you work with RL algorithms, right?
00:35:45.620 | I think the, the hunger for data is even greater.
00:35:47.700 | But when you play video games,
00:35:49.140 | the advantage of that is you can synthesize almost infinite amounts of data to,
00:35:52.820 | to feed this even greater more, right?
00:35:55.060 | Even greater need that RL algorithms have.
00:35:57.980 | Um, so just one note of caution,
00:36:02.220 | data synthesis has a lot of limits.
00:36:04.140 | Um, I'll tell you one other story.
00:36:05.860 | Um, you know, let's say you wanna recognize cars, right?
00:36:08.980 | Uh, there are a lot of video games,
00:36:10.580 | um, I need to play more video games.
00:36:12.340 | What's a video game with cars in it?
00:36:13.580 | Oh, GTA, Grand Theft Auto, right?
00:36:15.420 | So there's a bunch of cars in Grand Theft Auto.
00:36:17.100 | Why don't we just take pictures of cars from Grand Theft Auto,
00:36:19.580 | and you can synthesize lots of cars,
00:36:21.420 | lots of orientations there,
00:36:22.780 | and paste that, give that as training data.
00:36:25.100 | Um, it turns out that's difficult to do because from the human perceptual system,
00:36:30.300 | there might be 20 cars in a game,
00:36:32.180 | but it looks great to you because you can't tell if
00:36:34.420 | there are 20 cars in the game or 1,000 cars in the game, right?
00:36:37.540 | And so there are situations where the synthetic dataset looks great to you,
00:36:41.740 | because 20 cars in a video game is plenty, it turns out.
00:36:44.580 | Uh, you don't need 100 different cars for the human to think it looks realistic.
00:36:47.860 | But from the perspective of learning algorithm,
00:36:49.700 | this is a very impoverished data,
00:36:50.980 | very, very poor dataset.
00:36:52.380 | So, so I think it's so, so,
00:36:53.860 | so long to be, to be sourced out for data synthesis.
00:36:56.940 | Um, for those of you that work in companies,
00:36:59.620 | one, one practice I would strongly recommend is to have a unified data warehouse, right?
00:37:06.660 | Um, so what I mean is that if your teams, if your, you know,
00:37:12.540 | engineer teams or research teams are going around trying to
00:37:14.900 | accumulate the data from lots of different organizations in your company,
00:37:18.260 | that's just going to be a pain, it's going to be slow.
00:37:20.460 | So, um, at Baidu, you know,
00:37:23.060 | our, our policy is, um,
00:37:25.060 | is not your data is a company's data,
00:37:27.060 | and if it's user data,
00:37:28.300 | it goes into my user data warehouse.
00:37:30.460 | Uh, we, we, we should have a discussion about user access rights,
00:37:33.460 | privacy, and who can access what data.
00:37:35.660 | But at Baidu, I felt very strongly,
00:37:37.900 | so we mandate this data needs to come into one log- uh,
00:37:41.380 | as a logical warehouse, right?
00:37:42.900 | So it's physically distributed across lots of data centers,
00:37:45.260 | but they should be in one system,
00:37:46.660 | and what we should discuss is access rights,
00:37:49.100 | but what we should not discuss is whether or not to bring
00:37:51.700 | together data into as unified a data warehouse as possible.
00:37:55.340 | And so this is another practice that I found, um,
00:37:58.340 | makes access to data just much smoother and allows,
00:38:01.580 | you know, teams to, to, to, to drive performance.
00:38:04.060 | So really, if, if, if your boss asks you,
00:38:06.340 | tell them that I said,
00:38:07.620 | like build a unified data warehouse, right?
00:38:09.620 | [LAUGHTER]
00:38:12.980 | So, um, I wanna take the, uh,
00:38:16.740 | trained tests, you know,
00:38:18.900 | bias-variance picture and refine it.
00:38:21.580 | It turns out that this idea of a 70/30 split, right?
00:38:25.060 | Trained test or whatever, this was common in, um,
00:38:28.860 | machine learning kind of in the past when, you know, frankly,
00:38:32.260 | most of us in academia were working on relatively small data sets, right?
00:38:35.660 | And so, I don't know,
00:38:36.900 | there used to be this thing called the UC Irvine
00:38:38.980 | repository for machine learning data sets.
00:38:41.060 | You know, by today's- it's amazing resource at the time,
00:38:44.140 | but by today's standards, it's quite small.
00:38:45.900 | And so you download the data set,
00:38:47.420 | shuffle the data set,
00:38:48.580 | and you have, you know, trained,
00:38:49.900 | dev, test, and whatever.
00:38:51.260 | Um, in today, in production,
00:38:54.620 | machine learning today is much more common for
00:38:58.220 | your train and your test distributions to come from different distributions, right?
00:39:02.220 | And, and this creates new problems and new ways of thinking about bias and variance.
00:39:05.980 | So let, let, let me share, talk about that.
00:39:08.060 | Um, so actually here's a concrete example,
00:39:10.260 | and this is a real example from Baidu, right?
00:39:12.300 | We built a very effective speech recognition system.
00:39:14.700 | And then recently, actually,
00:39:16.340 | actually quite some time back now,
00:39:17.820 | we wanted to launch a new product that uses speech recognition.
00:39:20.900 | Um, we wanted a speech-enabled rear view mirror, right?
00:39:24.060 | So, you know, if you have a car that doesn't have a built-in GPS unit, right?
00:39:28.020 | Uh, we wanted, and this is a real product in China,
00:39:30.580 | we want to let you take out your rear view mirror and put a new, you know,
00:39:35.060 | AI-powered speech-powered rear view mirror because it's
00:39:38.060 | an easier, uh, uh, uh, like an off-the-market installation.
00:39:41.380 | So you can speak to your rear view mirror and say,
00:39:43.240 | "Dear rear view mirror, you know,
00:39:44.940 | navigate me to whatever," right?
00:39:46.780 | So this is a real product.
00:39:48.140 | Um, so, so, so how do you build a speech recognition system
00:39:51.660 | for this in-car speech-enabled rear view mirror?
00:39:54.780 | Um, so this is our status, right?
00:39:58.380 | We have, you know, let's call it 50,000 hours of data from,
00:40:03.700 | from a speech recognition data from all sorts of places, right?
00:40:06.740 | A lot of data we bought,
00:40:07.820 | some user data, uh, uh,
00:40:09.820 | that, that we have permission to use,
00:40:11.300 | but a lot of data collected from all sorts of places,
00:40:14.260 | but not your in-car rear view mirror scenario, right?
00:40:18.060 | And then our product managers can go around and,
00:40:20.780 | you know, through quite a lot of work.
00:40:22.340 | For this example, I'm going to say,
00:40:24.260 | let's say they collect 10 hours more of data from
00:40:27.620 | exactly the rear view mirror scenario, right?
00:40:31.660 | So, you know, install this thing in the car,
00:40:33.420 | get rid of the driver around, talk to,
00:40:34.700 | it's gonna collect 10 hours of data from
00:40:36.780 | exactly the distribution that you want to test on.
00:40:39.460 | So the question is, what do you do now, right?
00:40:42.180 | Do you throw this 50,000 hours of data away because it's
00:40:44.500 | not from the distribution one or,
00:40:45.940 | or can you use it in some way?
00:40:47.620 | Um, in the older pre-deep learning days,
00:40:51.500 | people used to build very separate models.
00:40:53.540 | So it was more common to build one speech model for a rear view mirror,
00:40:56.820 | one model for the maps voice query,
00:40:59.060 | one model for search, one model for that.
00:41:00.940 | And in the era of deep learning,
00:41:02.500 | it's becoming more and more common to just pile
00:41:04.140 | all the data into one model and let the model sort it out.
00:41:06.860 | And so long as your model is big enough,
00:41:08.780 | you could usually do this.
00:41:10.100 | And if you do little tech- if you get the features right,
00:41:12.780 | you could usually pile all the data into one model,
00:41:15.900 | uh, and often see gains,
00:41:17.780 | but certainly usually not see any losses.
00:41:20.420 | But the question is, given this dataset, you know,
00:41:23.540 | how do you split this into trained dev tests, right?
00:41:26.340 | So here's one thing you could do,
00:41:28.380 | which is call this your training set,
00:41:30.940 | call this your dev set,
00:41:32.940 | and call this your test set, right?
00:41:35.580 | Um, turns out this is a bad idea.
00:41:37.380 | I would not do this.
00:41:38.620 | And so one of the best practices with,
00:41:41.020 | with, with DRIVE is, um,
00:41:43.040 | make sure your development set and test sets are from the same distribution.
00:41:54.580 | Right. I've been finding that this is one of the tips that
00:41:57.540 | really boosts the effectiveness of a machine learning team.
00:42:00.940 | Um, so in particular, I would make this the training set,
00:42:04.780 | and then of my 10 hours,
00:42:06.620 | well, let me expand this a little bit, right?
00:42:08.500 | Much smaller dataset, maybe five hours dev, five hours of tests.
00:42:12.980 | And the reason for this is, um, uh,
00:42:16.100 | your team will be working to tune things on the dev set, right?
00:42:20.420 | And the last thing you want is if they spend three months working on the dev set,
00:42:24.540 | and then realize when they finally test it,
00:42:26.620 | that the test is totally different,
00:42:27.820 | a lot of work is wasted.
00:42:29.180 | So I think to make an analogy, you know,
00:42:31.900 | having different dev and test set distributions is a bit like if I tell you,
00:42:36.340 | "Hey, everyone, let's go north," right?
00:42:39.100 | And then a few hours later when,
00:42:40.940 | when all of you are in Oakland,
00:42:42.180 | I say, "Where are you?
00:42:43.240 | Wait, I want you to be in San Francisco."
00:42:44.900 | And you go, "What? Why should you tell me to go north?
00:42:46.660 | Tell me to go to San Francisco."
00:42:47.860 | Right. And so I think having dev and test sets be from the same distribution is one of
00:42:53.500 | the ideas that I found really optimizes the team's efficiency because it, you know,
00:42:57.860 | the development set, which is what your team is going to be tuning algorithms to,
00:43:02.060 | that is really the problem specification, right?
00:43:04.740 | And you, problem specification tells them to go here,
00:43:07.140 | but you actually want them to go there,
00:43:08.260 | you're going to waste a lot of effort.
00:43:09.860 | Um, and so when possible,
00:43:12.020 | having dev and test from the same distribution,
00:43:14.420 | which it isn't always, uh,
00:43:15.780 | there, there, there's some caverns,
00:43:16.900 | but when it's feasible to do so, um,
00:43:19.060 | this really improves the, the, the, the, um, the team's efficiency.
00:43:23.860 | Um, and another thing is once you specify the dev set,
00:43:27.620 | that's like your problem specification, right?
00:43:30.100 | Once you specify the test set,
00:43:31.500 | that's your problem specification.
00:43:33.100 | Your team might go and collect more training data or change the training set,
00:43:36.700 | or synthesize more training set.
00:43:38.220 | But, but, you know, you shouldn't change the test set if
00:43:40.980 | the test set is, is, is your problem specification, right?
00:43:44.060 | So, um, so in practice,
00:43:46.420 | what I actually recommend is splitting your training set as follows.
00:43:50.140 | Um, your training set,
00:43:51.380 | cover the small part of this,
00:43:53.060 | let me just say 20 hours of data to form,
00:43:55.660 | I'm going to call this the, um,
00:43:57.820 | training dev set, train-dev set.
00:44:02.100 | There's basically a development set that's from
00:44:03.940 | the same distribution as your training set,
00:44:06.180 | uh, and then you have your dev set and your test set, right?
00:44:08.660 | So these are what you actually,
00:44:09.940 | from the distribution you actually care about.
00:44:11.700 | And these, you have a training set,
00:44:13.420 | $50,000 of all sorts of data,
00:44:15.260 | and maybe we aren't even entirely sure what data this is.
00:44:18.260 | But split off just a small part of this.
00:44:20.180 | So I guess this is now, what,
00:44:22.140 | 49980 hours and 20 hours.
00:44:26.020 | Um, and then, here's the generalization of the bias-variance concept.
00:44:32.060 | Um, actually, let me use this board.
00:44:35.500 | And, and I have to say,
00:44:41.220 | the, the, um, the fact that training and test sets don't match is one of
00:44:46.300 | the problems that, um, academia doesn't study much.
00:44:49.500 | There's some work on domain adaptation.
00:44:51.060 | There is some literature on it.
00:44:52.260 | But it turns out that when you train and test on different distributions,
00:44:55.700 | you know, it, it sometimes it's just random.
00:44:57.740 | It's a little bit luck whether you generalize well to a totally different test set.
00:45:01.580 | So that's made it hard to study systematically,
00:45:04.460 | which is why I think, um,
00:45:05.780 | academia has not studied this particular problem as much as I feel it is important.
00:45:11.180 | So to, to those of us building production systems.
00:45:13.620 | But there is some work, but, but not, no, no,
00:45:16.020 | no very widely deployed solutions yet would be, would my sense.
00:45:19.580 | Um, but so I think our best practice is if,
00:45:21.900 | if you now generalize what I was describing just now to the following,
00:45:25.060 | which is, um, measure human level performance,
00:45:29.060 | measure your training set performance,
00:45:32.100 | measure your training depth performance,
00:45:37.820 | measure your depth set performance,
00:45:40.340 | and measure your test set performance, right?
00:45:42.420 | So now you have kind of five numbers.
00:45:44.700 | So to take an example,
00:45:46.500 | let's say human level is 1% error.
00:45:49.060 | Um, and I'm, I'm gonna use very obvious examples for illustration.
00:45:52.900 | If your training set performance is 10%,
00:45:55.340 | you know, and this is 10.1%, right?
00:45:58.020 | 10.1%, you know, 10.2%, right?
00:46:02.420 | In this example, then it's quite clear that you have a huge gap between
00:46:05.780 | human level performance and training set performance and so you have a huge bias, right?
00:46:11.740 | And, and so it kind of use the,
00:46:13.780 | the, the bias fixing types of, um, uh, uh, solutions.
00:46:18.140 | Um, and then, um, there's just one example I wanna, well.
00:46:24.460 | And so I find that in machine learning,
00:46:27.100 | one of the most useful things is to look at the aggregate error of your system,
00:46:32.180 | which in this case, you know,
00:46:33.540 | is your depth set of your test set error.
00:46:35.260 | And then to break down the components to,
00:46:37.180 | to figure out how much of whatever comes from where,
00:46:40.060 | so you know where to focus your attention.
00:46:42.020 | So this accumulation of errors, this difference here,
00:46:45.580 | this is maybe 9% bias, which is a lot.
00:46:47.620 | So I would work on the bias reduction techniques.
00:46:50.460 | Uh, this gap here, right?
00:46:52.260 | This is kind of, um, really the variance.
00:46:55.660 | This gap here is due to your train test distribution mismatch.
00:47:05.260 | Um, and this is overfitting of depth.
00:47:09.260 | Okay. Um, so just to be really concrete, um,
00:47:18.620 | here's an example where you have high train test error mismatch, right?
00:47:25.100 | Which is if human level performance is 1%,
00:47:27.380 | your training error is, you know, 2%.
00:47:30.260 | Uh, your training depth is 2.1%.
00:47:33.340 | And then on your depth set,
00:47:35.140 | uh, the error suddenly jumps to 10%, right?
00:47:37.820 | So this would- sorry, my, my,
00:47:39.460 | my, my x-axis doesn't perfectly line up.
00:47:42.960 | But if there's a huge gap here,
00:47:44.740 | then I would say you have a huge train test set mismatch problem.
00:47:48.460 | Okay. Um, and so at this basic level of analysis,
00:47:52.140 | what, you know, this formula for machine learning,
00:47:56.900 | instead of depth,
00:47:58.980 | I will replace this with train depth, right?
00:48:05.060 | And then in the rest of this, uh,
00:48:08.820 | really recipe for machine learning, um,
00:48:12.060 | I would then ask, um,
00:48:15.140 | is your depth error high?
00:48:17.740 | If yes, then you have a train test mismatch problem.
00:48:23.740 | And there the solution would be to try to get more data,
00:48:27.500 | uh, that's similar to test set, right?
00:48:34.060 | Or maybe a data synthesis or data augmentation.
00:48:38.700 | You know, try to tweak your training set to make it look more like your test set.
00:48:43.260 | Um, and then there's always this kind of, uh, uh,
00:48:46.140 | a Hail Mary, I guess,
00:48:47.500 | which is, you know, a new architecture, right?
00:48:50.260 | [LAUGHTER]
00:48:54.380 | Um, and then finally, just to finish this up, you know,
00:48:56.220 | there's not that much more.
00:48:57.700 | Finally, uh, there's this, yeah.
00:49:00.620 | And then hopefully, if you're done, uh,
00:49:02.820 | hopefully your test set error will be, will be good.
00:49:05.740 | And if, if you're doing well on your depth set but not your test set,
00:49:08.940 | it means you've over your depth set,
00:49:10.380 | so just get some more depth set data, right?
00:49:12.540 | Actually, I'll just write this, I guess.
00:49:14.500 | Test set error high, right?
00:49:19.940 | And if yes, then just get more depth data.
00:49:23.220 | Okay. And then done.
00:49:29.700 | Sorry, this is not too legible.
00:49:31.940 | What I wrote here is, uh,
00:49:34.300 | if your depth set error is not high but your test set error is high,
00:49:39.540 | it means you've overfit your depth set,
00:49:41.260 | so just get more test set,
00:49:42.780 | uh, get more depth set data, okay?
00:49:45.780 | Um, so one of the, um,
00:49:51.100 | effects I've seen is bias and variance is,
00:49:53.940 | it sounds so simple but it's actually much diffic- much more difficult to apply in
00:49:58.500 | practice than it sounds when I talk about it on, on text, right?
00:50:02.060 | So some tips.
00:50:03.300 | For a lot of problems,
00:50:04.540 | just calculate these numbers and this can help drive your analysis in terms of deciding what to do.
00:50:10.420 | Um, yeah.
00:50:14.020 | And, and I find that it takes surprisingly long to really grok,
00:50:19.100 | to really understand bias and variance deeply.
00:50:21.700 | But I find that people that understand bias and variance deeply are often
00:50:25.620 | able to drive very rapid progress in,
00:50:27.900 | in, in machine learning applications, right?
00:50:30.380 | And, and I know it's much sexier to show you some cool new network architecture and,
00:50:35.380 | I don't know, and, and, and, and, and,
00:50:36.980 | and this, this really helps our teams make rapid progress on things.
00:50:41.060 | Um, so, you know,
00:50:46.940 | there's one thing I, I, I kind of snuck in here without making it explicit,
00:50:51.540 | which is that in this whole analysis,
00:50:54.580 | we were benchmarking against human level performance, right?
00:50:58.940 | So there's another trend,
00:51:01.020 | another thing that, that, that has been different.
00:51:04.020 | Uh, again, you know, I'm,
00:51:05.300 | I'm looking across a lot of projects I've seen in many areas and
00:51:07.860 | trying to pull out the common trends but I find that comparing to
00:51:11.300 | human level performance is a much more common theme now than several years ago, right?
00:51:15.900 | With, with I guess Andre being the,
00:51:17.500 | the, the human level benchmark for ImageNet.
00:51:20.220 | Um, and, and, and really at Baidu we compare our speech systems
00:51:23.460 | to human level performance and try to exceed it and so on.
00:51:25.900 | So why is that?
00:51:27.300 | Um, it turns out that,
00:51:29.420 | so why, why, why is human level performance, right?
00:51:32.620 | Such a, such a common theme in, in applied deep learning.
00:51:37.660 | Um, it turns out that if, um,
00:51:40.980 | this, the x-axis is time as in,
00:51:43.660 | you know, how long you've been working on a project.
00:51:45.620 | And the y-axis is accuracy, right?
00:51:48.300 | If this is human level performance,
00:51:51.060 | you know, like human level accuracy or human level performance on some task,
00:51:56.340 | you'll find that for a lot of projects,
00:51:58.260 | your teams will make rapid progress, you know,
00:52:02.060 | up until they get to human level performance.
00:52:05.580 | And then often it will maybe surpass human level performance a bit,
00:52:09.420 | and then progress often gets much harder after that, right?
00:52:12.860 | But this is a common pattern I see in a lot of problems.
00:52:16.220 | Um, so there are multiple reasons why this is the case.
00:52:18.980 | I'm, I'm curious, like why,
00:52:20.100 | why, why, why do you think this is the case?
00:52:22.420 | Any, any guesses? Yeah.
00:52:23.660 | [inaudible]
00:52:25.220 | Cool. Labels are coming from humans.
00:52:26.540 | [inaudible]
00:52:28.620 | Oh, cool. Yep. Labels are coming from humans. Anything else?
00:52:30.700 | [LAUGHTER]
00:52:33.220 | All right. Cool. Anything else?
00:52:34.620 | [inaudible]
00:52:37.980 | Oh, interesting. Oxygen is modeled after the human brain.
00:52:39.820 | Yeah, I don't know. Maybe.
00:52:40.900 | I, I think that the, the,
00:52:42.060 | the distance from neural nets to human brains is very far.
00:52:44.660 | So that one I would, uh.
00:52:46.740 | I think that the human capacity,
00:52:49.820 | uh, to deal with these kind of problems is very similar.
00:52:53.420 | I see. Yeah. Human capacity to deal with these problems is similar.
00:52:55.900 | Yeah, kind of. Yeah.
00:52:57.100 | Oh, of course. Yeah.
00:52:58.060 | Just-
00:52:59.020 | You said bored.
00:53:00.300 | Oh, get bored. I see.
00:53:01.700 | [LAUGHTER]
00:53:02.500 | So you're saying, all right, cool.
00:53:04.180 | There's one more and then I'll just-
00:53:05.860 | [inaudible]
00:53:07.980 | Oh, be satisfied. Okay, cool.
00:53:09.260 | Be satisfied and bored, I guess,
00:53:10.820 | on two sides of the coin, I guess.
00:53:12.300 | [LAUGHTER]
00:53:14.620 | All right. So-
00:53:15.380 | [inaudible]
00:53:18.100 | Oh, it depends on your vision. Yeah, yeah.
00:53:19.540 | Cool. All right. So, so let me, let me, let me, uh.
00:53:21.940 | I think there are, there are, uh,
00:53:23.180 | all, all, all, you know, lots of great answers.
00:53:25.540 | Um, I think that there, there,
00:53:26.580 | there are several good reasons for this type of effect.
00:53:29.540 | Um, one of them is that, um,
00:53:33.060 | there is, for a lot of problems,
00:53:35.260 | there is some theoretical limit of performance, right?
00:53:39.180 | If, if, you know,
00:53:40.340 | some fraction of the data is just noisy.
00:53:42.660 | In speech recognition, a lot of audio clips are just noisy.
00:53:45.580 | Someone picked up a phone and, you know,
00:53:48.020 | they're in a rock concert or something and it's just
00:53:49.820 | impossible to figure out what on earth they were saying, right?
00:53:52.580 | Or some images, you know, are just so blurry.
00:53:54.620 | It's just impossible to figure out what this is.
00:53:57.140 | So there is some upper limit,
00:53:59.980 | theoretical limits of performance, um,
00:54:02.540 | called the optimal error rate, right?
00:54:06.500 | And, and the Bayesians will,
00:54:08.700 | will, will call this the Bayes rate, right?
00:54:10.820 | But really, there is some theoretical optimum where even if you had
00:54:14.620 | the best possible function, you know,
00:54:16.540 | with best possible parameters, it cannot do better than that because
00:54:19.660 | the input is just noisy and sometimes impossible to label.
00:54:22.980 | So it turns out that, um,
00:54:25.220 | humans are pretty good at a lot of the tasks we do, not all,
00:54:29.420 | but humans are actually pretty good at speech recognition,
00:54:31.340 | pretty good at computer vision.
00:54:32.740 | And so, you know, by the time you surpass human level accuracy,
00:54:36.340 | there might not be a lot of room, right, to go, to go further up.
00:54:39.780 | So that's kind of one reason,
00:54:41.060 | that's just humans are pretty good.
00:54:42.540 | Um, other reasons, I think a couple of people said, right?
00:54:45.420 | Um, and, and it turns out that, um,
00:54:47.500 | so long as you're still worse than humans, uh,
00:54:51.220 | you have better levels to make progress, right?
00:54:57.900 | Um, so, you know, while,
00:54:59.980 | while worse than humans,
00:55:05.820 | um, right, have good ways,
00:55:09.260 | uh, to make progress.
00:55:14.020 | And so some of those ways are, right?
00:55:17.060 | A couple of you mentioned this,
00:55:18.540 | you can get labels from humans, right?
00:55:27.380 | Um, you can also carry out error analysis.
00:55:32.660 | And error analysis just means look at your depth set,
00:55:35.940 | look at the examples you haven't got wrong,
00:55:37.700 | and see, you know, see if the humans have any insight into why a human thought this is a cat,
00:55:42.220 | but you haven't thought it was a dog,
00:55:43.460 | or why a human, you know,
00:55:44.900 | recognize this utterance correctly,
00:55:46.860 | but your system just, uh, mistranscribed this.
00:55:49.820 | Um, and then I think another reason is that it's easier to estimate,
00:55:58.220 | um, bias-variance effects.
00:56:01.260 | Right? And here's what I mean.
00:56:06.820 | Um, so let's see.
00:56:10.100 | To take another computing example,
00:56:12.220 | let's say that, uh, you're- you- let's say that you're working on some image recognition task.
00:56:17.660 | Right? If I tell you that, um, uh,
00:56:23.500 | your training error is 8%,
00:56:31.660 | um, and your depth error is 10%, right?
00:56:38.860 | Well, should you work on, you know,
00:56:41.260 | bias reduction techniques or should you work on variance reduction techniques?
00:56:44.620 | It's actually very unclear, right?
00:56:47.340 | If I tell you that humans get 7.5%,
00:56:53.500 | then you're pretty close on the training set to
00:56:56.620 | human and you would think you have more of a variance problem.
00:56:59.340 | If I tell you humans can get 1%,
00:57:03.380 | 1% error, then you know that even on the training set,
00:57:07.620 | you're doing way worse than humans.
00:57:09.260 | And so, well, you should build a bigger network or something, right?
00:57:12.500 | So this piece of information about where humans are,
00:57:15.340 | and- and I think of humans as a proxy,
00:57:17.460 | as an approximation for the Bayes error rate, for the optimal error rate.
00:57:20.500 | This piece of information really tells you where you should focus your effort,
00:57:23.660 | and therefore increases the efficiency of your team.
00:57:26.300 | But once you surpass human level efficiency, I mean,
00:57:28.900 | if- if- if even humans, you know,
00:57:31.060 | got, um, a 30% error, right?
00:57:34.300 | Then- then it's- it's- it's just slightly tougher.
00:57:36.940 | So that's just another thing that- that- that becomes harder to do,
00:57:40.100 | that you no longer have a proxy for estimating
00:57:42.980 | the Bayes error rate to decide how to improve performance, right?
00:57:47.260 | Um, so, you know, there are definitely lots of problems where we
00:57:49.980 | surpass human level performance and keep getting better and better.
00:57:53.100 | But I find that, uh, uh, uh,
00:57:55.300 | a lot of the- I find that my life building deep learning applications is
00:57:59.180 | often easier until we surpass human level performance,
00:58:01.940 | which is much better tools.
00:58:03.220 | And after we surpass human level performance, um, well,
00:58:06.380 | actually if you want the details,
00:58:07.340 | what we usually try to do is try to find
00:58:09.660 | subsets of data where we still do worse than humans.
00:58:12.340 | So find- let's say- so for example,
00:58:14.100 | right now we surpass human level performance for speech accuracy,
00:58:16.980 | uh, for short audio clips taken out of context.
00:58:19.460 | But if we find, for example,
00:58:20.580 | we're still way worse than humans on one particular type of accented speech.
00:58:24.780 | Then even if we are much better than humans in the aggregate,
00:58:27.420 | if we find we're much worse than humans on the subset of data,
00:58:30.020 | then all these levels still can apply.
00:58:31.860 | But remember this is kind of an advanced topic maybe,
00:58:33.900 | where- where you segment the training set and analyze sub-
00:58:36.900 | sub- separate subsets of the training set.
00:58:39.020 | Yeah.
00:58:40.020 | If you think there's a tool that can take a human error rate to 30% out of 1%,
00:58:44.620 | maybe we can build that tool.
00:58:46.060 | Yeah.
00:58:46.940 | Can you do that?
00:58:47.700 | I see. Actually, you know, that's a wonderful question.
00:58:50.020 | I want to ask a related quiz question to everyone in the audience.
00:58:52.940 | I'm gonna come back to- to- to what Alex just said.
00:58:55.300 | All right. So, um, given everything we just said,
00:58:59.380 | um, I have another quiz for you.
00:59:01.900 | All right. Um, I'm gonna pose a question, uh,
00:59:04.940 | write down four choices and then ask you to raise your hand to- to- to,
00:59:08.620 | to vote what you think is the right answer.
00:59:10.540 | Okay. So, um, I talked about, you know,
00:59:14.260 | how the concept of human level accuracy is useful for driving machine learning progress.
00:59:19.780 | Right. So, um, how do you define human level performance?
00:59:25.140 | Right. So here's a concrete example.
00:59:26.940 | I'm spending a lot of time working on AI in healthcare.
00:59:29.060 | So a lot of medical examples in my head right now.
00:59:31.060 | But let's say that you want to do medical imaging for medical diagnosis.
00:59:34.380 | You know, so read medical images,
00:59:35.860 | tell your patient a certain disease or not.
00:59:37.900 | Right. So, um, so medical example.
00:59:42.420 | So my question to you is how do you define human level performance?
00:59:47.540 | Um, choice A is, um,
00:59:50.660 | you know, a typical human, so a non-doctor.
00:59:53.700 | Right. Let's say that the error rate at reading a certain type of medical image is 3%,
00:59:58.820 | right. Choice B is a typical doctor.
01:00:06.980 | Let's say a typical doctor makes 1% error.
01:00:11.700 | Um, or I can find an expert doctor.
01:00:16.660 | And let's say an expert doctor makes 0.7% error.
01:00:21.780 | Or I can find a team of expert doctors.
01:00:28.740 | And what I mean is if I find a team of expert doctors and have a team look at
01:00:35.580 | every image and debate and discuss and have them come to,
01:00:38.580 | you know, the team's best guess of what's happening with this patient.
01:00:41.540 | Let's say I can get 0.5% error.
01:00:44.820 | So think for a few seconds.
01:00:46.060 | I'll ask you to vote by,
01:00:47.260 | by, by raising your hands.
01:00:48.580 | Which of these is the most useful definition of
01:00:52.420 | human level error if you want to use this to drive the performance of your algorithms?
01:00:56.380 | Okay. So who thinks choice A? Raise your hand.
01:01:01.140 | I have a question.
01:01:02.660 | Oh, sure.
01:01:03.260 | [inaudible]
01:01:06.180 | Uh, uh, uh, yeah.
01:01:07.580 | Uh, don't worry about ease of obtaining this data.
01:01:09.820 | Yeah. Right. So which is the most useful definition?
01:01:12.860 | Choice A, who? Anyone?
01:01:14.980 | Okay. Just a couple of people.
01:01:16.620 | Choice B, who thinks you use this?
01:01:18.580 | Cool. Like a fifth.
01:01:20.580 | Choice C, expert doctors?
01:01:22.820 | Another fifth. Choice D?
01:01:25.140 | Oh, cool. Wow. Interesting.
01:01:27.660 | All right. So, so I'll tell you that, um,
01:01:31.020 | I think that for the purpose of driving machine learning progress,
01:01:34.620 | I think ignoring the cost of collecting data was a great question.
01:01:38.420 | Um, I would find this definition the most useful, um,
01:01:42.620 | because I think that, um,
01:01:45.220 | a lot of what we're trying to use human level performance as a proxy for is the base rate,
01:01:50.180 | is really optimal error rate, right?
01:01:52.260 | And, and really to measure the baseline level of noise in your data.
01:01:55.740 | Um, and so, you know,
01:01:57.540 | if a team of human doctors can get 0.5%,
01:01:59.860 | then you know that the mathematically optimal error rate
01:02:02.460 | has got to be 0.5% or maybe even a little bit better.
01:02:05.580 | Um, and so for the purpose of using this number to drive all these decisions,
01:02:10.380 | such as, um, estimate bias and variance, right?
01:02:14.180 | Uh, uh, that definition gives you the best,
01:02:16.660 | you know, estimate of bias, right?
01:02:18.820 | Um, uh, because you know that the base error rate is,
01:02:21.540 | is 0.5 or lower.
01:02:23.380 | Um, in practice, because of the cost of,
01:02:26.660 | you know, getting labels and so on,
01:02:28.260 | in practice, you know,
01:02:29.660 | I would fully expect teams to use this definition, uh, uh, and, and, and,
01:02:34.060 | and by the way, publishing papers is different than, um,
01:02:37.100 | the goal of publishing papers is different than the goal of actually,
01:02:39.300 | you know, building the best possible product, right?
01:02:40.820 | So for the purpose of publishing papers,
01:02:42.820 | people like to say, oh, we're better than the human level.
01:02:45.020 | So for that, I guess using this definition would be what many people would do.
01:02:48.420 | Um, uh, and, and, and if you're actually trying to collect data, you know,
01:02:52.820 | there'll be some tiering where, right,
01:02:54.900 | get a typical doctor to label the example.
01:02:56.860 | If they aren't sure, hire an expert doctor.
01:02:58.380 | If they're still unsure, then find, you know,
01:03:00.540 | so, so for the purpose of data collection,
01:03:02.140 | you, you know, other processes.
01:03:03.660 | But for the mathematical analysis,
01:03:05.380 | I would tend to use 0.5 as,
01:03:07.260 | as, as, as my definition for that number.
01:03:09.620 | Cool. Question in the back?
01:03:10.660 | [inaudible]
01:03:18.020 | Oh, is it possible that team of expert doctors does worse than a single doctor?
01:03:21.500 | I don't know. I, I, I had to ask the doctors in the audience.
01:03:24.380 | I, I, I know. [LAUGHTER]
01:03:27.860 | All right. Um, all right.
01:03:30.740 | Just, just, I have just two more pages and I'll wrap up.
01:03:32.780 | Um, so, you know,
01:03:34.380 | one of the reasons I think in the era of deep learning,
01:03:36.780 | we, uh, refer to human level performance much more frankly is because,
01:03:40.700 | um, for a lot of these tasks,
01:03:42.420 | we are approaching human level performance, right?
01:03:44.860 | So when computer vision accuracy, you know,
01:03:47.420 | when, when, I guess maybe to continue this example, right?
01:03:50.820 | Um, if, you know,
01:03:52.100 | when your training set accuracy in computer vision was, you know,
01:03:56.060 | 30% and your depth error was like 35%,
01:04:00.060 | then it didn't really matter if human level performance was 1% or 2% or 3%.
01:04:04.420 | It, it didn't affect your decision that much because you're
01:04:06.700 | just so clearly far, so far from Bayes, right?
01:04:09.660 | But now as really more and more deep learning systems are
01:04:13.060 | approaching human levels performance on all these tasks,
01:04:15.420 | measuring human level performance, uh,
01:04:17.660 | actually gives you very useful information to,
01:04:19.700 | to, to drive decision-making.
01:04:21.220 | And so honestly for a lot of the teams I work with,
01:04:23.380 | when I meet with them,
01:04:24.380 | a very common piece of advice is,
01:04:26.060 | please go and figure out what is human level performance,
01:04:28.380 | and, and then spend some time to have humans label and get that number because that
01:04:32.180 | number is useful for, for, for, for driving some of these decisions.
01:04:36.780 | So, um, just two last things and then we'll finish.
01:04:41.700 | Um, you know, one question I get asked a lot is, um,
01:04:46.580 | what can AI do?
01:04:48.020 | Really, what can deep learning do, right?
01:04:50.460 | Um, and, and I guess maybe partially a company you often,
01:05:01.100 | you know, with the rise of AI,
01:05:02.900 | I feel like, um, uh,
01:05:05.340 | maybe this is again a company thing.
01:05:06.860 | Um, in Silicon Valley,
01:05:08.780 | we've developed pretty good workflows for designing
01:05:11.260 | products in the desktop era and in the mobile era, right?
01:05:15.100 | So with processes like draw a wireframe,
01:05:17.540 | the designer draws a wireframe,
01:05:19.060 | excuse me, the, the product manager draws a wireframe,
01:05:21.300 | the designer does the visual design or something or,
01:05:23.820 | you know, they work together and then the program implements it.
01:05:25.820 | So we have well-defined workflows for how to design,
01:05:29.340 | you know, typical apps like the Facebook app or the Snapchat app or whatever.
01:05:33.260 | We sort of know how to design.
01:05:34.780 | We have workflows established in companies to design stuff like that.
01:05:38.220 | In the era of AI, um,
01:05:40.980 | I feel like we don't have good processes yet for designing AI products.
01:05:45.740 | So for example, how should a product manager specify,
01:05:48.900 | you know, I don't know, a self-driving car,
01:05:50.780 | how do you specify the product definition?
01:05:52.940 | How does a product manager specify what level of accuracy is needed for my cat detector?
01:05:57.740 | Is that how, how, how?
01:05:58.780 | So today in Silicon Valley,
01:06:00.460 | with AI working better and better,
01:06:02.540 | um, I find us inventing new processes in order to design AI product, right?
01:06:07.700 | Processes that really didn't exist before.
01:06:09.580 | But one of the questions I often get asked,
01:06:11.620 | partially sometimes by product people,
01:06:13.260 | sometimes by business people is what can AI do?
01:06:15.060 | Because when a product manager is trying to design a new thing,
01:06:18.240 | you know, it's nice you can help them know what they can design,
01:06:21.380 | and what they can design,
01:06:22.540 | there's no way we can build, right?
01:06:24.020 | So, so, so, so when I,
01:06:25.860 | so, so I'm only giving some rules of thumb that are far from perfect,
01:06:29.500 | but that I found useful for thinking about what AI can do.
01:06:32.980 | Oh, oh, before I tell you the rules I use, um,
01:06:35.940 | here's one of the rules of thumb that a product manager I know was using,
01:06:39.520 | which is he says, assume that AI can do absolutely anything, right?
01:06:44.420 | And, and, and, and this actually wasn't terrible.
01:06:47.660 | It actually led to some good results, but, but I wanna,
01:06:50.740 | uh, uh, but I wanna give you some, some,
01:06:52.820 | some more nuanced, uh, uh,
01:06:54.380 | ways of communicating about modern deep learning,
01:06:56.940 | um, in, in, in, in, in these sorts of organizations.
01:06:59.620 | You know, one is, um,
01:07:01.240 | anything that a person,
01:07:04.860 | a typical person can do in less than one second, right?
01:07:13.420 | And I know this rule is far from perfect,
01:07:15.100 | there are a lot of counter examples to this,
01:07:16.680 | but this is one of the rules I found useful,
01:07:18.340 | which is that if it's a task that a normal person can do with less than one second of thinking,
01:07:22.680 | there's a very good chance we could automate it with deep learning.
01:07:25.060 | So, you know, given a piece of, uh,
01:07:27.220 | given a picture, tell me if the face in this picture is smiling or frowning.
01:07:30.780 | You don't need to think for more than a second.
01:07:32.960 | So, yes, we can build deep learning systems and do that really well, right?
01:07:36.980 | Um, or speech recognition, you know,
01:07:38.520 | like, uh, uh, listen to this audio clip, what did they say?
01:07:41.060 | You don't need to think for that long, it's less than a second.
01:07:43.360 | So this is really a lot of the perception work, uh,
01:07:45.940 | uh, in computer vision, speech, um,
01:07:48.260 | that, uh, deep learning is working on.
01:07:50.580 | This rule of thumb works less well for NLP,
01:07:52.700 | I think because humans just take time to re-text,
01:07:54.780 | but we found, um, right now at Baidu,
01:07:57.340 | a bunch of product managers looking around for tasks that humans can do in less than one second,
01:08:01.300 | uh, to try to automate them.
01:08:02.460 | So this has been highly flawed,
01:08:03.980 | but, but still useful rule of thumb. Um, there's a question?
01:08:07.060 | [inaudible]
01:08:17.200 | I see. Yeah.
01:08:18.840 | [inaudible]
01:08:21.080 | Yeah, actually great question.
01:08:22.080 | I feel like a lot of the value of deep learning,
01:08:24.000 | a lot, a lot of the, the, the concrete short-term applications,
01:08:27.720 | um, a lot of them have been, um,
01:08:29.240 | trying to automate things that people can do.
01:08:31.440 | Uh, really, especially people that do it in a very short time.
01:08:34.400 | And, and this feeds into all the advantages, you know,
01:08:37.980 | when, when you're trying to automate something that a human can already do.
01:08:41.100 | [inaudible]
01:08:48.580 | Oh, I see. Oh, that's an interesting observation.
01:08:49.980 | Oh, if a human can label in less than a second,
01:08:51.540 | you can get a lot of data. Yeah, that's an interesting observation.
01:08:53.620 | Cool. Yeah. Great.
01:08:56.300 | Um, and then I think another one,
01:08:58.060 | the, the other huge bucket of deep learning applications I've seen create tons of value,
01:09:02.260 | is, um, uh, predicting outcome of the next,
01:09:11.740 | uh, in sequence of events.
01:09:14.980 | Right. Um, but so, you know,
01:09:18.460 | if there's something that happens over and over,
01:09:20.180 | such as, you know, well,
01:09:21.560 | maybe not super inspiring, we show a user an ad, right?
01:09:23.800 | That happens a lot. Uh, uh,
01:09:25.700 | and, and the user clicks on it or doesn't click on it with
01:09:28.020 | tons of data to predict if the user will click on the next ad.
01:09:30.580 | Probably the most lucrative application of, uh,
01:09:32.620 | AI deep learning today.
01:09:33.880 | Or, you know, Baidu,
01:09:35.660 | we run a food delivery service.
01:09:37.060 | So we've seen a lot of data of,
01:09:38.660 | if you order food from this restaurant to go to this destination at this time of day,
01:09:42.060 | how long does it take? We've seen that a ton of times.
01:09:44.220 | Very good at predicting if you order food,
01:09:46.380 | how long will it take to, to, to, to send this food to you.
01:09:48.900 | So I feel like, I don't know,
01:09:50.500 | I, you know, deep learning does so much stuff.
01:09:52.420 | So I've struggled a bit to come up with simple rules to explain to,
01:09:55.700 | to, to really to product managers, right?
01:09:57.980 | How to design around it.
01:09:59.280 | I found these two rules useful even though I know these are
01:10:02.420 | clearly highly flawed and there are many, many counter examples, right?
01:10:06.540 | Um, so I think, let's see.
01:10:08.260 | Um, so it's, it's exciting to find for deep learning because I think it's
01:10:12.300 | letting us do a lot of interesting things.
01:10:14.340 | It's also causing us to rethink how we organize the companies.
01:10:17.260 | I build a systems team, makes an AI team,
01:10:18.900 | how we, the workflow of a process for, for, for, for our products.
01:10:22.620 | I think there's a lot of excitement going on.
01:10:24.900 | Um, the last thing I want to do is, um, you know,
01:10:28.220 | I found that the number one question I get asked is,
01:10:32.920 | um, uh, how do you build a career in machine learning, right?
01:10:38.920 | And I think, um, you know, when I, when I did a Reddit Ask Me Anything,
01:10:43.920 | a Reddit AMA, that was one of the questions that was asked.
01:10:46.680 | Even today, a few people came up to me and said, you know,
01:10:49.480 | take a machine learning course,
01:10:50.640 | you know, the machine learning MOOC on Coursera or something else.
01:10:52.960 | Um, what advice do you have for building a career in machine learning?
01:10:56.560 | I have to admit, I, I don't have an amazing answer to that, but
01:10:59.460 | since I get asked that so often and because I really want to think what would be
01:11:04.020 | the most useful content to you, I, I, I thought I'll at least attempt an answer,
01:11:07.660 | even though it is maybe not a great one, right?
01:11:10.540 | So this is the last thing I had, uh, at the start,
01:11:12.620 | which is kind of personal advice.
01:11:14.540 | Um, you know, I think that, um,
01:11:17.340 | I was asking myself this same question, uh, uh, uh, uh, like a,
01:11:21.300 | a couple of months ago, right?
01:11:22.340 | Which is, you know, after you've taken a machine learning course, um,
01:11:26.260 | what's the next step for, um, developing your machine learning career?
01:11:31.200 | And at that time, I thought, um,
01:11:33.720 | the best thing would be if you attend deep learning school.
01:11:36.960 | [LAUGH] So, so, so, so, Sammy, Peter and I got together to do this, I hope.
01:11:41.720 | [LAUGH] Um, this is really part of motivation.
01:11:44.960 | Um, and then, and then beyond that, right, what,
01:11:46.880 | what are the things that, that, that really help?
01:11:49.160 | So, I do have had, actually, I think all of our organizations have had quite a lot
01:11:53.640 | of people want to move from non-machine learning into machine learning.
01:11:57.460 | And when I look at the career paths, um, you know,
01:12:00.020 | one common thing is after taking these courses,
01:12:03.620 | to work on a project by yourself, right?
01:12:05.660 | I've seen, I have a lot of respect for Kaggle.
01:12:07.220 | A lot of people actually participate in Kaggle and learn from the blogs there and
01:12:10.460 | then, and then become better and better at it.
01:12:12.500 | Um, but I wanna share with you one other thing that I haven't really shared.
01:12:15.100 | Oh, by the way, almost everything I talk about today is,
01:12:17.180 | is, is new content that I've never presented before, right?
01:12:19.540 | So, so, so, I, I, so I hope it worked okay.
01:12:22.460 | [LAUGH] Thank you.
01:12:23.460 | [APPLAUSE]
01:12:28.680 | So, I want to share with you, really, the, the,
01:12:31.640 | one thing that is a PhD student process, right?
01:12:35.120 | Which is, you know, a lot of people, really,
01:12:39.920 | when I was teaching full time at Stanford, a lot of people would join Stanford and
01:12:43.040 | ask me, you know, how do I become a machine learning researcher?
01:12:46.160 | How do I have my own ideas on how to push the bleeding edge of machine learning?
01:12:50.480 | And whether, you know, you're working in robotics or machine learning or, or
01:12:54.980 | something else, right?
01:12:56.460 | There's one PhD student process that I find has been incredibly reliable.
01:13:00.660 | And and, and I'm gonna say it, and you may or may not trust it, but
01:13:06.500 | I've seen this work so reliably so many times that I hope you take my word for
01:13:10.580 | it, that this process reliably turns non-machine learning researchers into,
01:13:15.020 | you know, very good machine learning researchers, which is and
01:13:18.700 | there's no magic, really, read a lot of papers and work on replicating results.
01:13:24.040 | Right?
01:13:27.160 | And I think that the human brain is a remarkable device, you know?
01:13:32.080 | People often ask me, how do you have new ideas?
01:13:34.240 | And I find that if you read enough papers and replicate enough results,
01:13:38.360 | you will have new ideas on how to push for this to the odds, right?
01:13:42.000 | I, I don't know how the, I don't really, I don't know how the human brain works, but
01:13:45.680 | I've seen this be an incredibly reliable process.
01:13:48.360 | If you read enough papers and, you know, between 20 and 50 papers later, and
01:13:52.700 | it's not one or two, it's more like 20 or maybe 50, you will start to have your own
01:13:56.060 | ideas, and this has been, so you see Sammy's nodding his head.
01:13:58.820 | This is an incredibly reliable process, right?
01:14:01.540 | And then my other piece of advice is so
01:14:05.540 | sometimes people ask me what working in AI is like.
01:14:09.060 | And I think some people have this picture that when we work on AI, you know,
01:14:13.300 | at Baidu or Google, OpenAI, whatever.
01:14:15.980 | I think some people have this picture of us hanging out in these airy,
01:14:20.520 | you know, well-lit rooms with natural plants in the background.
01:14:25.520 | And we're all standing in front of a whiteboard discussing the future of
01:14:29.240 | humanity, right?
01:14:30.720 | >> [LAUGH] >> And all of you know,
01:14:34.320 | working on AI is not like that.
01:14:36.320 | Frankly, almost all we do is dirty work, right?
01:14:39.400 | >> [LAUGH] >> So
01:14:42.840 | one place that I've seen people get tripped up is when they think working on
01:14:46.700 | AI is that future of humanity stuff, and shy away from the dirty work.
01:14:52.060 | And dirty work means anything from going on the Internet and downloading data and
01:14:57.100 | cleaning data, or downloading a piece of code and tuning parameters to see what
01:15:00.980 | happens, or debugging your stack trace to figure out why this silly thing overflowed.
01:15:06.020 | Or optimizing the database, or hacking a GPU kernel to make it faster, or
01:15:10.340 | reading a paper and struggling to replicate the result.
01:15:13.040 | At the end, a lot of what we do comes down to dirty work.
01:15:16.240 | And yes, there are moments of inspiration, but
01:15:19.080 | I've seen people really stall if they refuse to get into the dirty work.
01:15:23.720 | So my advice to you is, and actually another place I've seen people stall is
01:15:29.160 | if they only do dirty work, then you can become great at data cleaning, but
01:15:33.880 | also not become better and better at having your own moments of inspiration.
01:15:38.360 | So one of the most reliable formulas I've seen is really if you do both of these.
01:15:42.940 | Dig into the dirty work.
01:15:44.140 | If your team needs you to do some dirty work, just go and do it.
01:15:48.340 | But in parallel, read a lot of papers.
01:15:50.660 | And I think the combination of these two is the most reliable formula I've seen for
01:15:55.020 | producing great researchers.
01:15:56.380 | So I want to close with just one more story about this.
01:16:07.900 | And I guess some of you may have heard me talk about the Saturday story, right?
01:16:13.720 | But for those of you that want to advance your career in machine learning,
01:16:17.480 | next weekend you have a choice, right?
01:16:20.240 | Next weekend, you can either stay at home and watch TV, or you could do this, right?
01:16:28.200 | And it turns out this is much harder, and then no short term rewards for
01:16:31.200 | doing this, right?
01:16:32.080 | If next weekend, I think this weekend you guys are all doing great.
01:16:34.960 | >> [LAUGH] >> But next weekend,
01:16:39.720 | if you spend next weekend studying, reading papers, refereeing results,
01:16:43.240 | there are no short term rewards.
01:16:44.400 | If you go to work the following Monday, your boss doesn't know what you did,
01:16:47.240 | your peers didn't know what you did.
01:16:48.760 | No one's gonna pat you on the back and say good job,
01:16:50.880 | you spent all weekend studying.
01:16:52.720 | And realistically, after working really, really hard next weekend,
01:16:55.960 | you're not actually that much better, you're barely any better at your job.
01:16:59.720 | So there's pretty much no reward for working really,
01:17:02.760 | really hard all of next weekend.
01:17:04.760 | But I think the secret to advancing your career is this.
01:17:08.920 | If you do this not just on one weekend, but do this a weekend after weekend for
01:17:13.040 | a year, you will become really good at this.
01:17:15.440 | In fact, everyone I've worked with at Stanford that was close and
01:17:20.560 | became great at this, everyone, actually including me, was a grad student.
01:17:25.000 | We all spent late nights hunched over like a neural net tuning hyperparameters,
01:17:29.560 | trying to figure out why it wasn't working.
01:17:31.480 | And it was that process of doing this not just one weekend, but weekend after weekend,
01:17:35.720 | that allowed all of us really to, our brains,
01:17:40.200 | neural networks to learn the patterns that taught us how to do this.
01:17:44.640 | So I hope that even after this weekend, you keep on spending the time to keep
01:17:49.120 | learning because I promise that if you do this for long enough,
01:17:51.680 | you will become really, really good at deep learning.
01:17:55.160 | So just to wrap up, I'm super excited about AI.
01:17:58.520 | And making this analogy that AI is the new electricity, right?
01:18:02.280 | And what I mean is that just as 100 years ago,
01:18:05.840 | electricity transformed industry after industry, right?
01:18:08.880 | Electricity transformed agriculture, manufacturing, transportation,
01:18:12.520 | communications.
01:18:14.240 | I feel like those of you that are familiar with AI are now in an amazing position
01:18:19.480 | to go out and transform not just one industry, but potentially a ton of industries.
01:18:24.360 | So I guess at Baidu, I have a fun job trying to transform not just one industry,
01:18:31.000 | but multiple industries.
01:18:32.560 | But I see that it's very rare in human history where one person,
01:18:40.720 | where someone like you can gain the skills and
01:18:44.000 | do the work to have such a huge impact on society.
01:18:47.720 | I think in Silicon Valley, the phrase change the world is overused, right?
01:18:51.760 | Every Stanford undergrad says I want to change the world.
01:18:54.240 | But for those of you that work in AI, I think that the path from what you do to
01:18:58.200 | actually having a big impact on a lot of people and helping a lot of people.
01:19:01.800 | In transportation, in healthcare, in logistics, in whatever,
01:19:05.000 | is actually becoming clearer and clearer.
01:19:07.360 | So I hope that all of you will keep working hard even after this weekend and
01:19:14.080 | go do a bunch of cool stuff for humanity.
01:19:16.640 | Thank you.
01:19:17.200 | >> [APPLAUSE]
01:19:27.200 | >> Thank you.
01:19:34.260 | >> [APPLAUSE]
01:19:39.240 | >> Do we make an announcement,
01:19:40.160 | Shivo? We're running super late, so
01:19:41.640 | I'll be around later if you want.
01:19:42.960 | Okay, so let's break for today and look forward to seeing everyone tomorrow.