Back to Index

Nuts and Bolts of Applying Deep Learning (Andrew Ng)


Transcript

>> So when we're organizing this workshop, my co-organizers initially asked me, hey Andrew, end of the first day, go give a visionary talk. So until several hours ago, my talk was advertised as visionary talk. >> >> But I was preparing for this presentation over the last several days. I've tried to think what would be the most useful information to you, and what are the things that you could take back to work on Monday and do something different at your job next Monday.

And I thought that we need to set context right now, as Peter mentioned, I lead Baidu's AI team. So it's a team of about 1,000 people working on vision, speech, NLP, lots of applications and machine learning. And so what I thought I'd do instead is, instead of taking the shiniest pieces of deep learning that I know, I want to take the lessons that I saw at Baidu that are common to so many different academic areas, as well as applications, autonomous cars, augmented reality, advertising, web search, medical diagnosis.

We'll take what I saw, the common lessons, the simple, powerful ideas that I've seen, how to drive a lot of machine learning progress at Baidu. And I thought I will share those ideas with you, because the patterns I see across a lot of projects, I thought might be the patterns that would be most useful to you as well, whatever you are working on in the next several weeks or months.

So one common theme that will appear in this presentation today is that the workflow of organizing machine learning projects feels like parts of it are changing in the era of deep learning. So for example, one of the ideas I'll talk about is bias-variance. It's a super old idea, right?

And then many of you, maybe all of you have heard of bias-invariance. But in the era of deep learning, I feel like there have been some changes to the way we think about bias-invariance. So I wanna talk about some of these ideas, which maybe aren't even deep learning per se, but have been slowly shifting as we apply deep learning to more and more of our applications, okay?

And instead of holding all your questions until the end, if you have a question in the middle, feel free to raise your hand as well. I'm very happy to take questions in the middle, since this is a more maybe informal whiteboard talk, right? And also, say hi to our home viewers, hi.

>> >> So one question that I still get asked sometimes is, and then kind of Andre alluded to this earlier, a lot of the basic ideas of deep learning have been around for decades. So why are they taking off just now, right? Why is it that deep learning, these neural networks have all known for maybe decades, why are they working so well now?

So I think that the one biggest trend in deep learning is scale, that scale drives deep learning progress. And I think Andre mentioned scale of data and scale of computation, and I'll just draw a picture that illustrates that concept maybe a little bit more. So if I plot a figure, where on the horizontal axis I plot the amount of data we have for a problem, and on the vertical axis we plot performance.

Right, so x-axis is the amount of spam data you've collected, y-axis is how accurately can you classify spam. Then if you apply traditional learning algorithms, right, what we found was that the performance often looks like it starts to plateau after a while. It was as if the older generations of learning algorithms, including support, logistic regression, SVMs, was as if they didn't know what to do with all the data that we finally had.

And what happened kind of over the last 20 years, last 10 years, was with the rise of the Internet, rise of mobile, rise of IoT. Where's the society sort of marched to the right of this curve, right, for many problems, not all problems. And so with all the buzz and all the hype about deep learning, in my opinion, the number one reason that deep learning algorithms work so well is that if you train, let me call it a small neural net, maybe you get slightly better performance.

If you train a medium-sized neural net, right, maybe you get even better performance. And it's only if you train a large neural net that you could train a model with the capacity to absorb all this data that we don't have access to, that allows you to get the best possible performance.

And so I feel like this is a trend that we're seeing in many verticals in many application areas. A couple of comments. One is that this, actually when I draw this picture, some people ask me, well, does this mean a small neural net always dominates a traditional learning algorithm?

The answer is not really. Technically, if you look at the small data regime, if you look at the left end of this plot, right, the relative ordering of these algorithms is not that well defined. It depends on who's more motivated to engineer the features better, right? If the SVM guy is more motivated to spend more time engineering features, they might beat out the neural network application.

But because when you don't have much data, a lot of the knowledge of the algorithm comes from hand engineering, right? But this trend is much more evident in a regime of big data where you just can't hand engineer enough features. And the large neural net combined with a lot of data tends to outperform.

So a couple of comments. The implication of this figure is that in order to get the best performance, in order to hit that target, you need two things, right? One is you need to train a very large neural network, or reasonably large neural network, and you need a large amount of data.

And so this in turn has caused pressure to train large neural nets, right? Build large nets as well as get huge amounts of data. So one of the other interesting trends I've seen is that increasingly I'm finding that it makes sense to build an AI team as well as build a computer systems team and have the two teams kind of sit next to each other.

And the reason I say that is, I guess, so let's see. So when we started Baidu Research, we set our team that way. Other teams are also organized this way. I think Peter mentioned to me that OpenAI also has a systems team and a machine learning team. And the reason we're starting to organize our teams that way, I think, is that some of the computer systems work we do, right?

So we have an HPCT team, high performance team, a super computing team at Baidu. Some of the extremely specialized knowledge in HPC is just incredibly difficult for an AI researcher to learn, right? Some people are super smart. Maybe Jeff Dean is smart enough to learn everything. But it's just difficult for any one human to be sufficiently expert in HPC and sufficiently expert in machine learning.

And so we've been finding, and Shubo actually, one of the co-organizers on our HPC team, we've been finding that bringing knowledge from these multiple sources, multiple communities, allows us to get our best performance. You've heard a lot of fantastic presentations today. I want to draw one other picture, which is, in my mind, this is how I mentally bucket work in deep learning.

So this might be a useful categorization, right? When you look at the talk, you can mentally put each talk into one of these buckets I'm about to draw. But I feel like there's a lot of work on, I'm gonna call, you know, general DL, general models. And this is basically what the type of model that Hugo La Rochelle talked about this morning, where you have, you know, really densely connected layers, right?

I guess FC, right, was the- so there's a huge bucket of models there. And then I think a second bucket is sequence models. So 1D sequences, and this is where I would bucket a lot of the work on RNNs, you know, LSTMs, right, GRUs, some of the attention models, which I guess probably Yoshua Benjamin's talk about tomorrow, or maybe others, maybe Kwaku, I'm not sure, right?

But so the 1D sequence models is another huge bucket. And the third bucket is the image models. This is really 2D and maybe sometimes 3D, but this is where I would tend to bucket all the work of CNNs, convolutional nets. And then in my mental bucket, then there's a fourth one, which is the other, right?

And this includes unsupervised learning, you know, the reinforcement learning, right, as well as lots of other creative ideas, being explored in the training. You know, like, what's, I still find slow feature analysis, sparse coding, ICA, various models kind of in the other category, super exciting. So it turns out that if you look across industry today, almost all the value today is driven by these three buckets, right?

So what I mean is those three buckets of algorithms are driving, causing us to have much better products, right, or monetizing very well. It's just incredibly useful for lots of things. In some ways, I think this bucket might be the future of AI, right? So I find unsupervised learning especially super exciting.

So I'm actually super excited about this as well. Although I think that if on Monday you have a job and you're trying to build a product or whatever, the chance of you using something from one of these three buckets will be highest. But I definitely encourage you to contribute to research here as well, right?

So I said that trend one, the major trend one of deep learning is scale. This is what I would say is maybe major trend two, of two of two trends. This is not gonna go on forever, right? I feel major trend two is the rise of end-to-end deep learning, especially for rich outputs.

And so end-to-end deep learning, I'll say a little bit more in a second exactly what I mean by that. But the examples I'm gonna talk about are all from one of these three buckets, right, general DL, sequence models, image 2D, 3D models. But let's see, it's best illustrated with a few examples.

Until recently, a lot of machine learning used to output just row numbers. So I guess in Richard's example, you have a movie review, right? And then, actually, I prepared totally different examples. I was editing my examples earlier to be more coherent with the speakers before me. We have a movie review and then output the sentiment.

Is this a positive or a negative movie review? Or you might have an image, right? And then you want to do image net object recognition. So this would be a 0, 1 output. This might be an integer from 1 to 1,000. But so until recently, a lot of machine learning was about outputting a single number, maybe a row number, maybe an integer.

Um, and I think the, the, the number two major trend that I'm really excited about is, um, enter in deep learning algorithms that can output much more complex things than numbers. And so one example that you've seen is, uh, image captioning where instead of taking an image and saying this is a cat, you can now take an image and output, you know, an entire string of text using RNN to generate that sequence.

So I guess what, uh, Andre who spoke just now, I think, uh, Aurel Vandals, uh, uh, Shu Wei at Baidu, right? A whole bunch of people have, have, have worked on this problem. Um, one of the things that I guess, uh, my, my, my collaborator Adam Coates will talk about tomorrow, uh, maybe Kwok as well, not sure, is, um, speech recognition where you take as input audio and you directly output, you know, the text transcript, right?

And so, um, when we first proposed using this kind of end-to-end architecture to do speech recognition, this is very controversial. We're building the work of Alex Graves, uh, but the idea of actually putting this in the production speech system was very, very controversial when we first, you know, said we wanted to do this.

But I think the whole community is coming around to this point of view more recently. Um, or, you know, machine translation, say go from English to French, right? So, uh, Ilya Saskova, Kwok Leur, uh, others, uh, working on this, a lot of teams now, um, or, you know, given the parameters, um, synthesize a brand new image, right?

And, and, and you saw some examples of image synthesis. So I feel like the, the, the second major trend of, of, of deep learning that I find very exciting and, and, I mean, this allowing us to build, you know, transformative things that we just couldn't build three or four years ago, has in this trend to what not just learning algorithms and output, not just a number, but it can output very complicated things like a sentence or a caption or French sentence or image or, or, or, or like the recent WaveNet paper output audio, right?

So I think this is a, maybe the second, um, major trend. So, um, despite all the excitement, um, about end-to-end deep learning, um, I think that end-to-end deep learning, you know, sadly is not the solution to everything. Um, I wanna give you some rules of thumb for deciding when to use, what is exactly end-to-end deep learning and when to use it and when not to use it.

So, so I'm moving to the second bullet and we'll, we'll, we'll go through these. So the trend towards end-to-end deep learning has been, um, this idea that instead of engineering a lot of intermediate representations, maybe you can go directly from your raw input to whatever you wanna predict, right?

So for example, actually it's a tick because I'm, I'm gonna use speech as a recurring example. Uh, so for speech recognition, um, previously one used to go from the audio to, you know, hand-engineered features like MFCCs or something and then maybe extract phonemes, right? Um, and then eventually you try to generate the transcript.

Um, for those of you that aren't sure what a phoneme is. So, uh, if you look at the word, listen to the word cat and the word kick, the k sound, right, is the same sound. And so phonemes are this, um, basic units of sound such as k as a phoneme, uh, and is, um, hypothesized by linguists to be the basic units of sound.

So k, a, t would be the, maybe the three phonemes that make up the word cat, right? So traditional speech systems used to, used to do this, uh, and I think 2011 Lee Dang and Jeff Hinton, um, made a lot of progress in speech recognition by saying we can use deep learning to do this first step.

Um, but the end-to-end approach to this would be to say, let's forget about phonemes, let's just have a neural net, right, input the audio and output the transcript. Um, so it turns out that in some problems, there's end-to-end approach. So one end is the input, the other end is the output.

So the phrase end-to-end deep learning refers to, uh, just having a neural net or, you know, like a learning algorithm directly go from input to output. That's, that's what end-to-end means. Um, this end-to-end formula, uh, is, I think it makes for, what, great PR, uh, and, and it's actually very simple but it only works sometimes.

Um, and actually, maybe, maybe, yeah, I'll just tell you this interesting story. You know, this end-to-end story really upset a lot of people. Um, when we were doing this work, I guess, I used to go around and say, I think phonemes are a fantasy of linguists. Um, and we should do away with them.

And I still remember there was a meeting at Stanford, some of you know who it was, there was a linguist kind of yelling at me in public, uh, for saying that. So maybe, maybe I should not, uh, but we turned out to be right, you know, so. All right.

Um, so let's see. Um, but the, the, the Achilles heel of a lot of deep learning is that you need tons of label data, right? So if, if this is your x and that's your y, then for end-to-end deep learning to work, you need a ton of label, you know, input output data, x, y.

So to take an example where, um, uh, where, you know, one may or may not consider end-to-end deep learning. Um, this is a problem I learned about just last week from Curtis Langlis and, and Darwin who's in the audience, I think, of, uh, imagine you want to use, um, x-ray pictures of your hand in order to predict the child's age, right?

So this is a real thing, you know, doctors actually care to look at an x-ray of your, of a child's hand in order to predict the, the age of the child. So, um, boy, let me draw an x-ray image, right? So this is, you know, the child's hand. So these are the bones, right?

I guess. This is why I'm not a doctor. Okay. So that's a hand and, and, and you see the bones. Um, and so more traditional algorithm might input an image, and then first, you know, extract the bones. So first figure out, oh, there's a bone here, there's a bone here, there's a bone here, and then maybe measure the length of these bones, right?

Um, so really I'm gonna say bone lengths, and then maybe have some formula, like some linear regression, average, some simple thing to go from the bone length to estimate the age of the child, right? So this is a non-end-to-end approach to try and solve this problem. Um, an end-to-end approach would be to take an image, and then, you know, run a convnet or whatever, and just try to output the age of the child.

And I think this is one example of a problem where, um, it's very challenging to get end-to-end deep learning to work, because you just don't have enough data. You just don't have enough X-rays of children's hands annotated with their ages. And instead, where we see deep learning coming in is in this step, right?

To use- go from image to, to figure out where the bones are, use deep learning for that. But the advantage of this non-end-to-end architecture is it allows you to hand engineer in more information about the system, such as how bone lengths map to age, right? Which- which you can kind of get tables about.

Um, there are a lot of examples like this. And I think one of the unfortunate things about deep learning is that, um, let's see. Uh, you know, you can- for- for- for suitably sexy values of X and Y, you could almost always train a model and publish a paper.

But that doesn't always mean that, you know, it's actually a good idea. Peter? I see. Yeah. I see. Yeah. I see. Yes, that's true. Yes. So Peter's pointing out that in practice, you could, um, uh, if this is a fixed function F, right? You could backprop all the way from the age, all the way back to the image.

Yeah, that's a good idea actually. Um, who was it just now who said you better do it quickly? Yeah. Um, let me give a couple of other examples, uh, uh, that where- where it might be harder to backprop all the way through, right? So here's- here's an example. Um, take self-driving cars.

You know, most teams are using an architecture where you input an image, you know, what's in front of the car, let's say. And then you, you know, detect other cars, right? Uh, and then- and- and maybe use the image to detect pedestrians, right? Self-driving cars are obviously more complex than this, right?

Uh, but then now that you know where the other cars and where the pedestrians are relative to your car, you then have a planning algorithm, uh, uh, to then, you know, come up with a trajectory, right? And then now that you know, um, what's the trajectory that you want your car to drive through, um, you could then, you know, compute the steering direction, right?

Let's say. And so, um, this is actually the architecture that most self-driving car teams are using. Um, and you know, there have been interesting approaches to- to say, well, I'm gonna input an image and I'll put a steering direction, right? And I think this is an example of where, um, at least with today's state of technology, I'd be very cautious about this second approach because- and I think, if you have enough data, the second approach will work and you can even prove a theorem, you know, showing that it will work, I think.

But, um, I don't know that anyone today has enough data to make the second approach really, really work well, right? And- and I think kind of the- the- Peter made a great comment just now. And I think, you know, some of these components will be incredibly complicated, you know, like this could be a power plan of an ex- explicit search.

And you could actually design a really complicated power plan and generate the trajectory. And your ability to hand code that still has a lot of value, right? So this is one thing to watch out for. Um, I have seen project teams say, I can get X, I can get Y, I'm gonna train deep learning.

Um, but unless you actually have the data, you know, some of these things make for great demos if- if you cherry pick the examples. But- but it can be challenging to, um, get to work at scale. I- I should say, if I sell driving cars, this debate is still open.

I'm- I'm cautious about this. I don't think it's a- I don't think this will necessarily fail. I just think the data needed to do this will be- will be really immense. So I- I'd be very cautious about it and then right now. But it might work if you have enough data.

Um, so, you know, one of the themes that comes up in machine learning, or really if you're working on a machine learning project, one thing that will often come up is, um, you will, you know, develop a learning system, uh, train it, maybe it doesn't work as well as you're hoping yet.

And the question is, what do you do next, right? This is a very common part of a machine learning, you know, a researcher or a machine learning engineer's life, which is, you know, you- you- you train a model, doesn't do what you want it to yet, so what do you do next, right?

This happens to us all the time. Um, and you face a lot of choices. You could collect more data, maybe you want to train it longer, maybe you want a different neural network architecture, maybe you want to try regularization, maybe you want a bigger model, or run some more GPUs.

So you have a lot of decisions. And I think that, um, a lot of the skill of a machine learning researcher or a machine learning engineer is knowing how to make these decisions, right? And- and- and the difference in performance and whether you, you know, do you train a bigger model or do you try regularization, your skill at picking these decisions will have a huge impact on how rapidly, um, uh, you can make progress on an actual machine learning problem.

So, um, I want to talk a little about bias and variance, since that's one of the most basic, you know, concepts in machine learning. And I feel like it's evolving slightly in the era of- of- of deep learning. So to use a- as a motivating example, um, let's say the goal is to build a human level, right, uh, speech system, right?

Speech recognition system, okay? So, um, what we would typically do, especially in academia, is we'll get a dataset, you know, here's my dataset with a lot of examples, and then we shuffle it and we randomly split it into 70-30 trained tests, or maybe- or maybe 70% trained, you know, 15% dev, and, uh, 15% test, right?

We- we take- oh, and, uh, some people use the term validation set, but I'm- I'm just gonna use the set- the dev set, or it stands for development set, means the same thing as validation set. Okay, so it's pretty common. Um, and so what we would- what- what- what I encourage you to do if you aren't already, is to measure the following things.

Um, human level error. So let- let- actually let me illustrate an example. Let's say that on your dev- uh, uh, let's say that, um, on your dev set, you know, human level error is, uh, 1% error. Um, let's say that your training set error is, um, let me use 5%, and let's say that your dev set error, really, right, dev set is a proxy for test set except you tune to the dev set, right?

Is, um, you know, 6% error. Okay. So this is one of the most basics- this- this is really a, a step in developing a learning algorithm that I encourage you to do if you aren't already, to figure out what are these three numbers. Because these three numbers, um, really helps in terms of telling you what to do next.

So in this example, um, you see that you're doing much worse than human level performance. Um, and so you see that there's a huge gap here from 1% to 5%. And I'm gonna call this, you know, right, the bias of your learning algorithm. Um, and for the statisticians in the room, I'm using the terms bias and variance informally and doesn't correspond exactly to the way they're defined in textbooks.

But I find these useful concepts for- for- for deciding how to make progress on your problem. Um, and so I would say that, you know, in this example, you have a high bias classifier. You try training a bigger model, maybe try training longer. We'll come- come back to this in a second.

Um, for a different example, you know, so this is one example. Uh, for a different example, if human level error is 1% and, uh, training set error were 2%, right, and depth set error was 6%, then, you know, you really have a high, what, variance problem, right, like an overfitting problem.

And this tells you- this really tells you what to do, what to try, right? Try adding regularization or try, um, uh, or try early stopping or, um, or even better, we get more data, right? Um, and then there's also really a third case which is if you have, uh, 1% human level error, um, I'm gonna say 6% depth set error.

Oh, actually, I'm gonna say 5% depth set error, and, uh, 10%, um, excuse me, 5% training error and 10% depth set error. And in this case, you have high bias and high variance, right? Um, so- so I guess, yeah, high bias and high variance, you know, like sucks for you, right?

Um, so I feel like that when I talk to applied machine learning teams, there's one really simple workflow, um, that is enough to help you make a lot of decisions about what you should be doing on your machine learning application. Um, and by- if- if you're wondering why I'm talking about this and what this has to do with deep learning, I'll come back to this in a second, right?

Does this change in error deep learning? But, uh, uh, I feel like there's this, you know, almost a workflow, like almost a- a flow chart, right? Which is first ask yourself, um, is your training error high? Oh, and I hope I'm writing big enough that people can see it.

If- if you have trouble reading, let me know and I'll- and I'll read it back out, right? But first I'll ask, you know, are you even doing well in your training set? Um, and- and- and if your training error is high, then you know you have high bias. And so your standard tactics like train a bigger model, just train a bigger neural network, um, or maybe try training longer, you know, make sure that your- your optimization algorithm is- is doing a good enough job.

Um, and then there's also this magical one which is a new model architecture, which is a hard one, right? Um, come back to that in a second, okay? And then you kind of keep doing that until you're doing well at least on your training set. Once you're at least doing well on your training set, so your training error is no longer high.

So no, training error is not unacceptably high. Um, we then ask, you know, is your depth error high, right? And if the answer is yes, then, um, well, if your depth set error is high, then you have a high variance problem, you have an overfitting problem. And so, you know, the solutions are try to get more data, right, or add regularization, or try a new model architecture, right?

And then until- and- and you kind of keep doing this until your, uh, depth set error is- is- is- is no- is- I guess until both you're doing well on your training set and on your depth set. And then, you know, hopefully, right, you're done. So I think one of the- um, one of the nice things about this era of deep learning is that no matter- it's kind of, no matter where you're stuck with modern deep learning tools, we have a clear path for making progress in a way that was not true, or at least was much less true in the era before deep learning, which is in particular, no matter what your problem is, overfitting or underfitting, uh, really high bias or high variance or maybe both, right?

You always have at least one action you can take, which is bigger model or more data. So you could- so- so- so in the deep learning era, relative to say the logistic regression error, the SVM error, it feels like we more often have a way out of whatever problem we're stuck in.

Um, and so I feel like these days, people talk less about bias-variance trade-off. You might have heard that term, bias-variance trade-off, underfitting versus overfitting. And the reason we talked a lot about that in the past was because a lot of the moves available to us like tuning regularization, that really traded off bias and variance.

So it was like a, you know, zero something, right? And you- you could improve one, but that makes the other one worse. But in the era of deep learning, really one of the reasons I think deep learning has been so powerful, is that the coupling between bias and variance can be weaker.

And we now have tools, we now have better tools to, you know, reduce bias without increasing variance or reduce variance without increasing bias. And really the bigger- the- the- the big one is really, you can always train a bigger model, bigger neural network in a way that was harder to do when you're training which is regression is to come up with more and more features, right?

So that was just harder to do. Um, so let's see. One of the- and I'm gonna add more to this diagram at the bottom in a second, okay? Um, one of the effects of this, maybe this- and- and- and by the way, I've been surprised, I mean, honestly, um, this new model architecture, that's really hard, right?

It takes a lot of experience, but- but even if you aren't super experienced with, you know, a variety of deep learning models, the things in the blue boxes, you can often do those and that will drive a lot of progress, right? But if you have experience with, you know, how to tune a confident versus a resonant versus whatever, by all means, try those things as well.

Definitely encourage you to keep mastering those as well. But this dumb formula of more data, bigger- bigger model, more data is enough to do very well on a lot of problems. So, um, let's see. Oh, so bigger model puts pressure on, you know, systems which is why we- we have high-performance computing team.

Um, more data has led to another interesting, um, set of investments. So, uh, with, you know, I guess a lot of us have always what needed, that had this insatiable hunger for data, we use, you know, crowdsourcing for labeling, um, uh, we try to come up with all sorts of clever ways to come- to- to- to get data.

Um, one- one area that- that I'm seeing more and more activity in, right? It feels a little bit nascent, but I'm seeing a lot of activity in is, um, automatic data synthesis, right? Um, let's see. And so here's what I mean. You know, once upon a time, people used to hand engineer features, and there was a lot of skill in hand engineering the features of, you know, like the SIF or the HOG or whatever to feed into SVM.

Um, automatic data synthesis is this little area that is small, but feels like it's growing, where there is some hand engineering needed, but I'm seeing quite a lot of progress in multiple problems is enabled by hand engineering, uh, synthetic data in order to feed into the giant mole of your neural network.

All right. So let me- let me best illustrate it with a couple of examples. Um, one of the easy ones is, uh, OCR. So- so let's say you want to train a, um, optical character recognition system, and actually I've been surprised at Baidu. This has tons of users, actually.

This is one of my most useful APIs at Baidu, right? Um, if you imagine firing up Microsoft Word, um, and downloading a random picture off the Internet, then choose a random Microsoft Word font, choose a random word in the English dictionary, and just type the English word into Microsoft Word in a random font, and paste that on top, you know, like a transparent background on top of a random image off the Internet, then you just synthesize a training example for OCR, right?

Um, and so this gives you access to essentially unlimited amounts of data. It turns out that the simple idea I just described won't work in its natural form. You actually need to do a lot of tuning to blur the synthesized text of the background, to make sure the color contrast matches your training distribution.

So found in practice can be a lot of work to fine-tune how you synthesize data. But I've seen in many verticals, um, and I'll give a few examples. If you do that engineering work, and sadly it's painful engineering, you could actually get a lot of progress. Actually, actually Tao Wang, uh, who was a student here at Stanford, um, uh, the effect I saw was he engineered this for months with very little progress and then suddenly he got the parameters right, and he had huge amounts of data and was able to build one of the best OCR systems in the world at that time, right?

Um, other examples, speech recognition, right? One of the most powerful ideas, uh, for building a, uh, effective speech system is if you take clean audio, you know, this is like a clean relatively noiseless audio, and take random background sounds and just synthesize what that person's voice would sound like in the presence of that background noise, right?

And this turns out to work remarkably well. So if you record a lot of car noise, what the inside of your car sounds like, and record a lot of clean audio of someone speaking in a quiet environment, um, the mathematical operation is actually addition, the superposition of sound, but you basically add the two waveforms together, and then you get an audio clip that sounds like that person talking in the car, and you feed this to your learning algorithm.

And so this has a dramatic effect in, in terms of amplifying the training set for speech recognition, and has a huge effect, can have a huge- we found a huge effect on, um, performance. Um, and then also NLP. You know, here's, here's one example actually done by, uh, some Stanford students which is, um, using end-to-end deep learning to do grammar correction.

So input a ungrammatical English sentence, you know, maybe written by a non-native speaker, right? And can you automatically have a, have a, I guess attention RNN, input an ungrammatical sentence and correct the grammar, just edit the sentence for me. Um, and it turns out that you can synthesize huge amounts of this type of data automatically.

And so there'll be another example where data synthesis, um, works very well. Um, and, oh, and I think, uh, uh, video games and RL, right? Really, one of the, um, well, let me say games broadly, right? One of the most powerful, um, uh, applications of RL, deep RL these days is video games.

And I think if you think supervised learning has an insatiable hunger for data wait till you work with RL algorithms, right? I think the, the hunger for data is even greater. But when you play video games, the advantage of that is you can synthesize almost infinite amounts of data to, to feed this even greater more, right?

Even greater need that RL algorithms have. Um, so just one note of caution, data synthesis has a lot of limits. Um, I'll tell you one other story. Um, you know, let's say you wanna recognize cars, right? Uh, there are a lot of video games, um, I need to play more video games.

What's a video game with cars in it? Oh, GTA, Grand Theft Auto, right? So there's a bunch of cars in Grand Theft Auto. Why don't we just take pictures of cars from Grand Theft Auto, and you can synthesize lots of cars, lots of orientations there, and paste that, give that as training data.

Um, it turns out that's difficult to do because from the human perceptual system, there might be 20 cars in a game, but it looks great to you because you can't tell if there are 20 cars in the game or 1,000 cars in the game, right? And so there are situations where the synthetic dataset looks great to you, because 20 cars in a video game is plenty, it turns out.

Uh, you don't need 100 different cars for the human to think it looks realistic. But from the perspective of learning algorithm, this is a very impoverished data, very, very poor dataset. So, so I think it's so, so, so long to be, to be sourced out for data synthesis. Um, for those of you that work in companies, one, one practice I would strongly recommend is to have a unified data warehouse, right?

Um, so what I mean is that if your teams, if your, you know, engineer teams or research teams are going around trying to accumulate the data from lots of different organizations in your company, that's just going to be a pain, it's going to be slow. So, um, at Baidu, you know, our, our policy is, um, is not your data is a company's data, and if it's user data, it goes into my user data warehouse.

Uh, we, we, we should have a discussion about user access rights, privacy, and who can access what data. But at Baidu, I felt very strongly, so we mandate this data needs to come into one log- uh, as a logical warehouse, right? So it's physically distributed across lots of data centers, but they should be in one system, and what we should discuss is access rights, but what we should not discuss is whether or not to bring together data into as unified a data warehouse as possible.

And so this is another practice that I found, um, makes access to data just much smoother and allows, you know, teams to, to, to, to drive performance. So really, if, if, if your boss asks you, tell them that I said, like build a unified data warehouse, right? So, um, I wanna take the, uh, trained tests, you know, bias-variance picture and refine it.

It turns out that this idea of a 70/30 split, right? Trained test or whatever, this was common in, um, machine learning kind of in the past when, you know, frankly, most of us in academia were working on relatively small data sets, right? And so, I don't know, there used to be this thing called the UC Irvine repository for machine learning data sets.

You know, by today's- it's amazing resource at the time, but by today's standards, it's quite small. And so you download the data set, shuffle the data set, and you have, you know, trained, dev, test, and whatever. Um, in today, in production, machine learning today is much more common for your train and your test distributions to come from different distributions, right?

And, and this creates new problems and new ways of thinking about bias and variance. So let, let, let me share, talk about that. Um, so actually here's a concrete example, and this is a real example from Baidu, right? We built a very effective speech recognition system. And then recently, actually, actually quite some time back now, we wanted to launch a new product that uses speech recognition.

Um, we wanted a speech-enabled rear view mirror, right? So, you know, if you have a car that doesn't have a built-in GPS unit, right? Uh, we wanted, and this is a real product in China, we want to let you take out your rear view mirror and put a new, you know, AI-powered speech-powered rear view mirror because it's an easier, uh, uh, uh, like an off-the-market installation.

So you can speak to your rear view mirror and say, "Dear rear view mirror, you know, navigate me to whatever," right? So this is a real product. Um, so, so, so how do you build a speech recognition system for this in-car speech-enabled rear view mirror? Um, so this is our status, right?

We have, you know, let's call it 50,000 hours of data from, from a speech recognition data from all sorts of places, right? A lot of data we bought, some user data, uh, uh, that, that we have permission to use, but a lot of data collected from all sorts of places, but not your in-car rear view mirror scenario, right?

And then our product managers can go around and, you know, through quite a lot of work. For this example, I'm going to say, let's say they collect 10 hours more of data from exactly the rear view mirror scenario, right? So, you know, install this thing in the car, get rid of the driver around, talk to, it's gonna collect 10 hours of data from exactly the distribution that you want to test on.

So the question is, what do you do now, right? Do you throw this 50,000 hours of data away because it's not from the distribution one or, or can you use it in some way? Um, in the older pre-deep learning days, people used to build very separate models. So it was more common to build one speech model for a rear view mirror, one model for the maps voice query, one model for search, one model for that.

And in the era of deep learning, it's becoming more and more common to just pile all the data into one model and let the model sort it out. And so long as your model is big enough, you could usually do this. And if you do little tech- if you get the features right, you could usually pile all the data into one model, uh, and often see gains, but certainly usually not see any losses.

But the question is, given this dataset, you know, how do you split this into trained dev tests, right? So here's one thing you could do, which is call this your training set, call this your dev set, and call this your test set, right? Um, turns out this is a bad idea.

I would not do this. And so one of the best practices with, with, with DRIVE is, um, make sure your development set and test sets are from the same distribution. Right. I've been finding that this is one of the tips that really boosts the effectiveness of a machine learning team.

Um, so in particular, I would make this the training set, and then of my 10 hours, well, let me expand this a little bit, right? Much smaller dataset, maybe five hours dev, five hours of tests. And the reason for this is, um, uh, your team will be working to tune things on the dev set, right?

And the last thing you want is if they spend three months working on the dev set, and then realize when they finally test it, that the test is totally different, a lot of work is wasted. So I think to make an analogy, you know, having different dev and test set distributions is a bit like if I tell you, "Hey, everyone, let's go north," right?

And then a few hours later when, when all of you are in Oakland, I say, "Where are you? Wait, I want you to be in San Francisco." And you go, "What? Why should you tell me to go north? Tell me to go to San Francisco." Right. And so I think having dev and test sets be from the same distribution is one of the ideas that I found really optimizes the team's efficiency because it, you know, the development set, which is what your team is going to be tuning algorithms to, that is really the problem specification, right?

And you, problem specification tells them to go here, but you actually want them to go there, you're going to waste a lot of effort. Um, and so when possible, having dev and test from the same distribution, which it isn't always, uh, there, there, there's some caverns, but when it's feasible to do so, um, this really improves the, the, the, the, um, the team's efficiency.

Um, and another thing is once you specify the dev set, that's like your problem specification, right? Once you specify the test set, that's your problem specification. Your team might go and collect more training data or change the training set, or synthesize more training set. But, but, you know, you shouldn't change the test set if the test set is, is, is your problem specification, right?

So, um, so in practice, what I actually recommend is splitting your training set as follows. Um, your training set, cover the small part of this, let me just say 20 hours of data to form, I'm going to call this the, um, training dev set, train-dev set. There's basically a development set that's from the same distribution as your training set, uh, and then you have your dev set and your test set, right?

So these are what you actually, from the distribution you actually care about. And these, you have a training set, $50,000 of all sorts of data, and maybe we aren't even entirely sure what data this is. But split off just a small part of this. So I guess this is now, what, 49980 hours and 20 hours.

Um, and then, here's the generalization of the bias-variance concept. Um, actually, let me use this board. And, and I have to say, the, the, um, the fact that training and test sets don't match is one of the problems that, um, academia doesn't study much. There's some work on domain adaptation.

There is some literature on it. But it turns out that when you train and test on different distributions, you know, it, it sometimes it's just random. It's a little bit luck whether you generalize well to a totally different test set. So that's made it hard to study systematically, which is why I think, um, academia has not studied this particular problem as much as I feel it is important.

So to, to those of us building production systems. But there is some work, but, but not, no, no, no very widely deployed solutions yet would be, would my sense. Um, but so I think our best practice is if, if you now generalize what I was describing just now to the following, which is, um, measure human level performance, measure your training set performance, measure your training depth performance, measure your depth set performance, and measure your test set performance, right?

So now you have kind of five numbers. So to take an example, let's say human level is 1% error. Um, and I'm, I'm gonna use very obvious examples for illustration. If your training set performance is 10%, you know, and this is 10.1%, right? 10.1%, you know, 10.2%, right? In this example, then it's quite clear that you have a huge gap between human level performance and training set performance and so you have a huge bias, right?

And, and so it kind of use the, the, the bias fixing types of, um, uh, uh, solutions. Um, and then, um, there's just one example I wanna, well. And so I find that in machine learning, one of the most useful things is to look at the aggregate error of your system, which in this case, you know, is your depth set of your test set error.

And then to break down the components to, to figure out how much of whatever comes from where, so you know where to focus your attention. So this accumulation of errors, this difference here, this is maybe 9% bias, which is a lot. So I would work on the bias reduction techniques.

Uh, this gap here, right? This is kind of, um, really the variance. This gap here is due to your train test distribution mismatch. Um, and this is overfitting of depth. Okay. Um, so just to be really concrete, um, here's an example where you have high train test error mismatch, right?

Which is if human level performance is 1%, your training error is, you know, 2%. Uh, your training depth is 2.1%. And then on your depth set, uh, the error suddenly jumps to 10%, right? So this would- sorry, my, my, my, my x-axis doesn't perfectly line up. But if there's a huge gap here, then I would say you have a huge train test set mismatch problem.

Okay. Um, and so at this basic level of analysis, what, you know, this formula for machine learning, instead of depth, I will replace this with train depth, right? And then in the rest of this, uh, really recipe for machine learning, um, I would then ask, um, is your depth error high?

If yes, then you have a train test mismatch problem. And there the solution would be to try to get more data, uh, that's similar to test set, right? Or maybe a data synthesis or data augmentation. You know, try to tweak your training set to make it look more like your test set.

Um, and then there's always this kind of, uh, uh, a Hail Mary, I guess, which is, you know, a new architecture, right? Um, and then finally, just to finish this up, you know, there's not that much more. Finally, uh, there's this, yeah. And then hopefully, if you're done, uh, hopefully your test set error will be, will be good.

And if, if you're doing well on your depth set but not your test set, it means you've over your depth set, so just get some more depth set data, right? Actually, I'll just write this, I guess. Test set error high, right? And if yes, then just get more depth data.

Okay. And then done. Sorry, this is not too legible. What I wrote here is, uh, if your depth set error is not high but your test set error is high, it means you've overfit your depth set, so just get more test set, uh, get more depth set data, okay?

Um, so one of the, um, effects I've seen is bias and variance is, it sounds so simple but it's actually much diffic- much more difficult to apply in practice than it sounds when I talk about it on, on text, right? So some tips. For a lot of problems, just calculate these numbers and this can help drive your analysis in terms of deciding what to do.

Um, yeah. And, and I find that it takes surprisingly long to really grok, to really understand bias and variance deeply. But I find that people that understand bias and variance deeply are often able to drive very rapid progress in, in, in machine learning applications, right? And, and I know it's much sexier to show you some cool new network architecture and, I don't know, and, and, and, and, and, and this, this really helps our teams make rapid progress on things.

Um, so, you know, there's one thing I, I, I kind of snuck in here without making it explicit, which is that in this whole analysis, we were benchmarking against human level performance, right? So there's another trend, another thing that, that, that has been different. Uh, again, you know, I'm, I'm looking across a lot of projects I've seen in many areas and trying to pull out the common trends but I find that comparing to human level performance is a much more common theme now than several years ago, right?

With, with I guess Andre being the, the, the human level benchmark for ImageNet. Um, and, and, and really at Baidu we compare our speech systems to human level performance and try to exceed it and so on. So why is that? Um, it turns out that, so why, why, why is human level performance, right?

Such a, such a common theme in, in applied deep learning. Um, it turns out that if, um, this, the x-axis is time as in, you know, how long you've been working on a project. And the y-axis is accuracy, right? If this is human level performance, you know, like human level accuracy or human level performance on some task, you'll find that for a lot of projects, your teams will make rapid progress, you know, up until they get to human level performance.

And then often it will maybe surpass human level performance a bit, and then progress often gets much harder after that, right? But this is a common pattern I see in a lot of problems. Um, so there are multiple reasons why this is the case. I'm, I'm curious, like why, why, why, why do you think this is the case?

Any, any guesses? Yeah. Cool. Labels are coming from humans. Oh, cool. Yep. Labels are coming from humans. Anything else? All right. Cool. Anything else? Oh, interesting. Oxygen is modeled after the human brain. Yeah, I don't know. Maybe. I, I think that the, the, the distance from neural nets to human brains is very far.

So that one I would, uh. I think that the human capacity, uh, to deal with these kind of problems is very similar. I see. Yeah. Human capacity to deal with these problems is similar. Yeah, kind of. Yeah. Oh, of course. Yeah. Just- You said bored. Oh, get bored. I see.

So you're saying, all right, cool. There's one more and then I'll just- Oh, be satisfied. Okay, cool. Be satisfied and bored, I guess, on two sides of the coin, I guess. All right. So- Oh, it depends on your vision. Yeah, yeah. Cool. All right. So, so let me, let me, let me, uh.

I think there are, there are, uh, all, all, all, you know, lots of great answers. Um, I think that there, there, there are several good reasons for this type of effect. Um, one of them is that, um, there is, for a lot of problems, there is some theoretical limit of performance, right?

If, if, you know, some fraction of the data is just noisy. In speech recognition, a lot of audio clips are just noisy. Someone picked up a phone and, you know, they're in a rock concert or something and it's just impossible to figure out what on earth they were saying, right?

Or some images, you know, are just so blurry. It's just impossible to figure out what this is. So there is some upper limit, theoretical limits of performance, um, called the optimal error rate, right? And, and the Bayesians will, will, will call this the Bayes rate, right? But really, there is some theoretical optimum where even if you had the best possible function, you know, with best possible parameters, it cannot do better than that because the input is just noisy and sometimes impossible to label.

So it turns out that, um, humans are pretty good at a lot of the tasks we do, not all, but humans are actually pretty good at speech recognition, pretty good at computer vision. And so, you know, by the time you surpass human level accuracy, there might not be a lot of room, right, to go, to go further up.

So that's kind of one reason, that's just humans are pretty good. Um, other reasons, I think a couple of people said, right? Um, and, and it turns out that, um, so long as you're still worse than humans, uh, you have better levels to make progress, right? Um, so, you know, while, while worse than humans, um, right, have good ways, uh, to make progress.

And so some of those ways are, right? A couple of you mentioned this, you can get labels from humans, right? Um, you can also carry out error analysis. And error analysis just means look at your depth set, look at the examples you haven't got wrong, and see, you know, see if the humans have any insight into why a human thought this is a cat, but you haven't thought it was a dog, or why a human, you know, recognize this utterance correctly, but your system just, uh, mistranscribed this.

Um, and then I think another reason is that it's easier to estimate, um, bias-variance effects. Right? And here's what I mean. Um, so let's see. To take another computing example, let's say that, uh, you're- you- let's say that you're working on some image recognition task. Right? If I tell you that, um, uh, your training error is 8%, um, and your depth error is 10%, right?

Well, should you work on, you know, bias reduction techniques or should you work on variance reduction techniques? It's actually very unclear, right? If I tell you that humans get 7.5%, then you're pretty close on the training set to human and you would think you have more of a variance problem.

If I tell you humans can get 1%, 1% error, then you know that even on the training set, you're doing way worse than humans. And so, well, you should build a bigger network or something, right? So this piece of information about where humans are, and- and I think of humans as a proxy, as an approximation for the Bayes error rate, for the optimal error rate.

This piece of information really tells you where you should focus your effort, and therefore increases the efficiency of your team. But once you surpass human level efficiency, I mean, if- if- if even humans, you know, got, um, a 30% error, right? Then- then it's- it's- it's just slightly tougher.

So that's just another thing that- that- that becomes harder to do, that you no longer have a proxy for estimating the Bayes error rate to decide how to improve performance, right? Um, so, you know, there are definitely lots of problems where we surpass human level performance and keep getting better and better.

But I find that, uh, uh, uh, a lot of the- I find that my life building deep learning applications is often easier until we surpass human level performance, which is much better tools. And after we surpass human level performance, um, well, actually if you want the details, what we usually try to do is try to find subsets of data where we still do worse than humans.

So find- let's say- so for example, right now we surpass human level performance for speech accuracy, uh, for short audio clips taken out of context. But if we find, for example, we're still way worse than humans on one particular type of accented speech. Then even if we are much better than humans in the aggregate, if we find we're much worse than humans on the subset of data, then all these levels still can apply.

But remember this is kind of an advanced topic maybe, where- where you segment the training set and analyze sub- sub- separate subsets of the training set. Yeah. If you think there's a tool that can take a human error rate to 30% out of 1%, maybe we can build that tool.

Yeah. Can you do that? I see. Actually, you know, that's a wonderful question. I want to ask a related quiz question to everyone in the audience. I'm gonna come back to- to- to what Alex just said. All right. So, um, given everything we just said, um, I have another quiz for you.

All right. Um, I'm gonna pose a question, uh, write down four choices and then ask you to raise your hand to- to- to, to vote what you think is the right answer. Okay. So, um, I talked about, you know, how the concept of human level accuracy is useful for driving machine learning progress.

Right. So, um, how do you define human level performance? Right. So here's a concrete example. I'm spending a lot of time working on AI in healthcare. So a lot of medical examples in my head right now. But let's say that you want to do medical imaging for medical diagnosis.

You know, so read medical images, tell your patient a certain disease or not. Right. So, um, so medical example. So my question to you is how do you define human level performance? Um, choice A is, um, you know, a typical human, so a non-doctor. Right. Let's say that the error rate at reading a certain type of medical image is 3%, right.

Choice B is a typical doctor. Let's say a typical doctor makes 1% error. Um, or I can find an expert doctor. And let's say an expert doctor makes 0.7% error. Or I can find a team of expert doctors. And what I mean is if I find a team of expert doctors and have a team look at every image and debate and discuss and have them come to, you know, the team's best guess of what's happening with this patient.

Let's say I can get 0.5% error. So think for a few seconds. I'll ask you to vote by, by, by raising your hands. Which of these is the most useful definition of human level error if you want to use this to drive the performance of your algorithms? Okay. So who thinks choice A?

Raise your hand. I have a question. Oh, sure. Uh, uh, uh, yeah. Uh, don't worry about ease of obtaining this data. Yeah. Right. So which is the most useful definition? Choice A, who? Anyone? Okay. Just a couple of people. Choice B, who thinks you use this? Cool. Like a fifth.

Choice C, expert doctors? Another fifth. Choice D? Oh, cool. Wow. Interesting. All right. So, so I'll tell you that, um, I think that for the purpose of driving machine learning progress, I think ignoring the cost of collecting data was a great question. Um, I would find this definition the most useful, um, because I think that, um, a lot of what we're trying to use human level performance as a proxy for is the base rate, is really optimal error rate, right?

And, and really to measure the baseline level of noise in your data. Um, and so, you know, if a team of human doctors can get 0.5%, then you know that the mathematically optimal error rate has got to be 0.5% or maybe even a little bit better. Um, and so for the purpose of using this number to drive all these decisions, such as, um, estimate bias and variance, right?

Uh, uh, that definition gives you the best, you know, estimate of bias, right? Um, uh, because you know that the base error rate is, is 0.5 or lower. Um, in practice, because of the cost of, you know, getting labels and so on, in practice, you know, I would fully expect teams to use this definition, uh, uh, and, and, and, and by the way, publishing papers is different than, um, the goal of publishing papers is different than the goal of actually, you know, building the best possible product, right?

So for the purpose of publishing papers, people like to say, oh, we're better than the human level. So for that, I guess using this definition would be what many people would do. Um, uh, and, and, and if you're actually trying to collect data, you know, there'll be some tiering where, right, get a typical doctor to label the example.

If they aren't sure, hire an expert doctor. If they're still unsure, then find, you know, so, so for the purpose of data collection, you, you know, other processes. But for the mathematical analysis, I would tend to use 0.5 as, as, as, as my definition for that number. Cool. Question in the back?

Oh, is it possible that team of expert doctors does worse than a single doctor? I don't know. I, I, I had to ask the doctors in the audience. I, I, I know. All right. Um, all right. Just, just, I have just two more pages and I'll wrap up. Um, so, you know, one of the reasons I think in the era of deep learning, we, uh, refer to human level performance much more frankly is because, um, for a lot of these tasks, we are approaching human level performance, right?

So when computer vision accuracy, you know, when, when, I guess maybe to continue this example, right? Um, if, you know, when your training set accuracy in computer vision was, you know, 30% and your depth error was like 35%, then it didn't really matter if human level performance was 1% or 2% or 3%.

It, it didn't affect your decision that much because you're just so clearly far, so far from Bayes, right? But now as really more and more deep learning systems are approaching human levels performance on all these tasks, measuring human level performance, uh, actually gives you very useful information to, to, to drive decision-making.

And so honestly for a lot of the teams I work with, when I meet with them, a very common piece of advice is, please go and figure out what is human level performance, and, and then spend some time to have humans label and get that number because that number is useful for, for, for, for driving some of these decisions.

So, um, just two last things and then we'll finish. Um, you know, one question I get asked a lot is, um, what can AI do? Really, what can deep learning do, right? Um, and, and I guess maybe partially a company you often, you know, with the rise of AI, I feel like, um, uh, maybe this is again a company thing.

Um, in Silicon Valley, we've developed pretty good workflows for designing products in the desktop era and in the mobile era, right? So with processes like draw a wireframe, the designer draws a wireframe, excuse me, the, the product manager draws a wireframe, the designer does the visual design or something or, you know, they work together and then the program implements it.

So we have well-defined workflows for how to design, you know, typical apps like the Facebook app or the Snapchat app or whatever. We sort of know how to design. We have workflows established in companies to design stuff like that. In the era of AI, um, I feel like we don't have good processes yet for designing AI products.

So for example, how should a product manager specify, you know, I don't know, a self-driving car, how do you specify the product definition? How does a product manager specify what level of accuracy is needed for my cat detector? Is that how, how, how? So today in Silicon Valley, with AI working better and better, um, I find us inventing new processes in order to design AI product, right?

Processes that really didn't exist before. But one of the questions I often get asked, partially sometimes by product people, sometimes by business people is what can AI do? Because when a product manager is trying to design a new thing, you know, it's nice you can help them know what they can design, and what they can design, there's no way we can build, right?

So, so, so, so when I, so, so I'm only giving some rules of thumb that are far from perfect, but that I found useful for thinking about what AI can do. Oh, oh, before I tell you the rules I use, um, here's one of the rules of thumb that a product manager I know was using, which is he says, assume that AI can do absolutely anything, right?

And, and, and, and this actually wasn't terrible. It actually led to some good results, but, but I wanna, uh, uh, but I wanna give you some, some, some more nuanced, uh, uh, ways of communicating about modern deep learning, um, in, in, in, in, in these sorts of organizations. You know, one is, um, anything that a person, a typical person can do in less than one second, right?

And I know this rule is far from perfect, there are a lot of counter examples to this, but this is one of the rules I found useful, which is that if it's a task that a normal person can do with less than one second of thinking, there's a very good chance we could automate it with deep learning.

So, you know, given a piece of, uh, given a picture, tell me if the face in this picture is smiling or frowning. You don't need to think for more than a second. So, yes, we can build deep learning systems and do that really well, right? Um, or speech recognition, you know, like, uh, uh, listen to this audio clip, what did they say?

You don't need to think for that long, it's less than a second. So this is really a lot of the perception work, uh, uh, in computer vision, speech, um, that, uh, deep learning is working on. This rule of thumb works less well for NLP, I think because humans just take time to re-text, but we found, um, right now at Baidu, a bunch of product managers looking around for tasks that humans can do in less than one second, uh, to try to automate them.

So this has been highly flawed, but, but still useful rule of thumb. Um, there's a question? I see. Yeah. Yeah, actually great question. I feel like a lot of the value of deep learning, a lot, a lot of the, the, the concrete short-term applications, um, a lot of them have been, um, trying to automate things that people can do.

Uh, really, especially people that do it in a very short time. And, and this feeds into all the advantages, you know, when, when you're trying to automate something that a human can already do. Oh, I see. Oh, that's an interesting observation. Oh, if a human can label in less than a second, you can get a lot of data.

Yeah, that's an interesting observation. Cool. Yeah. Great. Um, and then I think another one, the, the other huge bucket of deep learning applications I've seen create tons of value, is, um, uh, predicting outcome of the next, uh, in sequence of events. Right. Um, but so, you know, if there's something that happens over and over, such as, you know, well, maybe not super inspiring, we show a user an ad, right?

That happens a lot. Uh, uh, and, and the user clicks on it or doesn't click on it with tons of data to predict if the user will click on the next ad. Probably the most lucrative application of, uh, AI deep learning today. Or, you know, Baidu, we run a food delivery service.

So we've seen a lot of data of, if you order food from this restaurant to go to this destination at this time of day, how long does it take? We've seen that a ton of times. Very good at predicting if you order food, how long will it take to, to, to, to send this food to you.

So I feel like, I don't know, I, you know, deep learning does so much stuff. So I've struggled a bit to come up with simple rules to explain to, to, to really to product managers, right? How to design around it. I found these two rules useful even though I know these are clearly highly flawed and there are many, many counter examples, right?

Um, so I think, let's see. Um, so it's, it's exciting to find for deep learning because I think it's letting us do a lot of interesting things. It's also causing us to rethink how we organize the companies. I build a systems team, makes an AI team, how we, the workflow of a process for, for, for, for our products.

I think there's a lot of excitement going on. Um, the last thing I want to do is, um, you know, I found that the number one question I get asked is, um, uh, how do you build a career in machine learning, right? And I think, um, you know, when I, when I did a Reddit Ask Me Anything, a Reddit AMA, that was one of the questions that was asked.

Even today, a few people came up to me and said, you know, take a machine learning course, you know, the machine learning MOOC on Coursera or something else. Um, what advice do you have for building a career in machine learning? I have to admit, I, I don't have an amazing answer to that, but since I get asked that so often and because I really want to think what would be the most useful content to you, I, I, I thought I'll at least attempt an answer, even though it is maybe not a great one, right?

So this is the last thing I had, uh, at the start, which is kind of personal advice. Um, you know, I think that, um, I was asking myself this same question, uh, uh, uh, uh, like a, a couple of months ago, right? Which is, you know, after you've taken a machine learning course, um, what's the next step for, um, developing your machine learning career?

And at that time, I thought, um, the best thing would be if you attend deep learning school. So, so, so, so, Sammy, Peter and I got together to do this, I hope. Um, this is really part of motivation. Um, and then, and then beyond that, right, what, what are the things that, that, that really help?

So, I do have had, actually, I think all of our organizations have had quite a lot of people want to move from non-machine learning into machine learning. And when I look at the career paths, um, you know, one common thing is after taking these courses, to work on a project by yourself, right?

I've seen, I have a lot of respect for Kaggle. A lot of people actually participate in Kaggle and learn from the blogs there and then, and then become better and better at it. Um, but I wanna share with you one other thing that I haven't really shared. Oh, by the way, almost everything I talk about today is, is, is new content that I've never presented before, right?

So, so, so, I, I, so I hope it worked okay. Thank you. So, I want to share with you, really, the, the, one thing that is a PhD student process, right? Which is, you know, a lot of people, really, when I was teaching full time at Stanford, a lot of people would join Stanford and ask me, you know, how do I become a machine learning researcher?

How do I have my own ideas on how to push the bleeding edge of machine learning? And whether, you know, you're working in robotics or machine learning or, or something else, right? There's one PhD student process that I find has been incredibly reliable. And and, and I'm gonna say it, and you may or may not trust it, but I've seen this work so reliably so many times that I hope you take my word for it, that this process reliably turns non-machine learning researchers into, you know, very good machine learning researchers, which is and there's no magic, really, read a lot of papers and work on replicating results.

Right? And I think that the human brain is a remarkable device, you know? People often ask me, how do you have new ideas? And I find that if you read enough papers and replicate enough results, you will have new ideas on how to push for this to the odds, right?

I, I don't know how the, I don't really, I don't know how the human brain works, but I've seen this be an incredibly reliable process. If you read enough papers and, you know, between 20 and 50 papers later, and it's not one or two, it's more like 20 or maybe 50, you will start to have your own ideas, and this has been, so you see Sammy's nodding his head.

This is an incredibly reliable process, right? And then my other piece of advice is so sometimes people ask me what working in AI is like. And I think some people have this picture that when we work on AI, you know, at Baidu or Google, OpenAI, whatever. I think some people have this picture of us hanging out in these airy, you know, well-lit rooms with natural plants in the background.

And we're all standing in front of a whiteboard discussing the future of humanity, right? >> >> And all of you know, working on AI is not like that. Frankly, almost all we do is dirty work, right? >> >> So one place that I've seen people get tripped up is when they think working on AI is that future of humanity stuff, and shy away from the dirty work.

And dirty work means anything from going on the Internet and downloading data and cleaning data, or downloading a piece of code and tuning parameters to see what happens, or debugging your stack trace to figure out why this silly thing overflowed. Or optimizing the database, or hacking a GPU kernel to make it faster, or reading a paper and struggling to replicate the result.

At the end, a lot of what we do comes down to dirty work. And yes, there are moments of inspiration, but I've seen people really stall if they refuse to get into the dirty work. So my advice to you is, and actually another place I've seen people stall is if they only do dirty work, then you can become great at data cleaning, but also not become better and better at having your own moments of inspiration.

So one of the most reliable formulas I've seen is really if you do both of these. Dig into the dirty work. If your team needs you to do some dirty work, just go and do it. But in parallel, read a lot of papers. And I think the combination of these two is the most reliable formula I've seen for producing great researchers.

So I want to close with just one more story about this. And I guess some of you may have heard me talk about the Saturday story, right? But for those of you that want to advance your career in machine learning, next weekend you have a choice, right? Next weekend, you can either stay at home and watch TV, or you could do this, right?

And it turns out this is much harder, and then no short term rewards for doing this, right? If next weekend, I think this weekend you guys are all doing great. >> >> But next weekend, if you spend next weekend studying, reading papers, refereeing results, there are no short term rewards.

If you go to work the following Monday, your boss doesn't know what you did, your peers didn't know what you did. No one's gonna pat you on the back and say good job, you spent all weekend studying. And realistically, after working really, really hard next weekend, you're not actually that much better, you're barely any better at your job.

So there's pretty much no reward for working really, really hard all of next weekend. But I think the secret to advancing your career is this. If you do this not just on one weekend, but do this a weekend after weekend for a year, you will become really good at this.

In fact, everyone I've worked with at Stanford that was close and became great at this, everyone, actually including me, was a grad student. We all spent late nights hunched over like a neural net tuning hyperparameters, trying to figure out why it wasn't working. And it was that process of doing this not just one weekend, but weekend after weekend, that allowed all of us really to, our brains, neural networks to learn the patterns that taught us how to do this.

So I hope that even after this weekend, you keep on spending the time to keep learning because I promise that if you do this for long enough, you will become really, really good at deep learning. So just to wrap up, I'm super excited about AI. And making this analogy that AI is the new electricity, right?

And what I mean is that just as 100 years ago, electricity transformed industry after industry, right? Electricity transformed agriculture, manufacturing, transportation, communications. I feel like those of you that are familiar with AI are now in an amazing position to go out and transform not just one industry, but potentially a ton of industries.

So I guess at Baidu, I have a fun job trying to transform not just one industry, but multiple industries. But I see that it's very rare in human history where one person, where someone like you can gain the skills and do the work to have such a huge impact on society.

I think in Silicon Valley, the phrase change the world is overused, right? Every Stanford undergrad says I want to change the world. But for those of you that work in AI, I think that the path from what you do to actually having a big impact on a lot of people and helping a lot of people.

In transportation, in healthcare, in logistics, in whatever, is actually becoming clearer and clearer. So I hope that all of you will keep working hard even after this weekend and go do a bunch of cool stuff for humanity. Thank you. >> >> Thank you. >> >> Do we make an announcement, Shivo?

We're running super late, so I'll be around later if you want. Okay, so let's break for today and look forward to seeing everyone tomorrow.