3:34 Bypassing the curse of dimensionality We need to build compositionality into our ML models Just as human languages exploit compositionality to give representations and meanings to complex ideas
9:56 The need for distributed representations Clustering
12:56 Each feature can be discovered without the need for seeing the exponentially large number of configurations of the other features Consider a network whose hidden units discover the following features: - Person wears glasses
24:30 Exponential advantage of depth
35:15 Attention Mechanism for Deep Learning . Consider an input for intermediate sequence or image . Consider an upper level representation, which can choose
36:48 Attention Mechanisms for Memory Access Enable Reasoning
56:12 Learning Multiple Levels of Abstraction
58:11 Towards Key Principles of Learning for both Machines and Brains

00:00:04.320 | So I'll tell you about some very high level stuff today and
00:00:11.140 | no new algorithm.
00:00:12.120 | Some of you already know about the book that Ian Goodfellow,
00:00:17.880 | Aaron Corville and I have written.
00:00:19.800 | And it's now in pre-sale by MIT Press.
00:00:23.800 | I think you can find it on Amazon or something.
00:00:28.880 | And the paper, the actual shipping is gonna be in December, hopefully for NIPS.
00:00:34.600 | So we've already heard that story at least,
00:00:41.920 | well from several people here, at least from Andrew I think.
00:00:46.720 | But it's good to ponder a little bit some of these ingredients that seem to be
00:00:53.680 | important for deep learning to succeed.
00:00:57.720 | But in general for machine learning to succeed, to learn really complicated tasks
00:01:02.080 | of the kind we want to reach human level performance.
00:01:07.760 | So if a machine is gonna be intelligent,
00:01:10.520 | it's going to need to acquire a lot of information about the world.
00:01:17.120 | And the big success of machine learning for AI has been to show that
00:01:22.680 | we can provide that information through data, through examples.
00:01:27.520 | But really think about it,
00:01:29.360 | that machine will need to know a huge amount of information about the world around us.
00:01:33.680 | This is not how we are doing it now because we're not able to train such big
00:01:37.400 | models, but it will come one day.
00:01:39.440 | And so we'll need models that are much bigger than the ones we currently have.
00:01:42.880 | Of course, that means machine learning algorithms that can represent
00:01:48.000 | complicated functions, that's one good thing about neural nets.
00:01:51.360 | But there are many other machine learning approaches that allow you in principle
00:01:55.440 | to represent very flexible forms like non-parametric methods,
00:02:00.200 | classical non-parametric methods or SVMs.
00:02:02.360 | But they're gonna be missing 0.4 and
00:02:08.400 | potentially 0.5 depending on the methods.
00:02:10.840 | 0.3 of course, you need enough computing power to train and
00:02:16.920 | use these big models.
00:02:19.000 | And 0.5 just says that it's not enough to be able to train the model,
00:02:24.280 | you have to be able to use it in a reasonably efficient way from a computational
00:02:28.840 | perspective.
00:02:29.840 | This is not always the case with some probabilistic models where inference,
00:02:34.080 | in other words, answering questions, having the computer do something,
00:02:37.920 | can be intractable and then you need to do some approximations,
00:02:41.440 | which could be efficient or not.
00:02:44.080 | Now, the point I really want to talk about is the fourth one,
00:02:47.160 | how do we defeat the curse of dimensionality?
00:02:51.440 | In other words, if you don't assume much about the world,
00:02:54.520 | it's actually impossible to learn about it.
00:02:58.520 | And so I'm gonna tell you a bit about the assumptions
00:03:06.120 | that are behind a lot of deep learning algorithms which make it possible to
00:03:10.480 | work as well as we are seeing in practice in the last few years.
00:03:30.800 | so how do we bypass the curse of dimensionality?
00:03:34.200 | The curse of dimensionality is about the exponentially large number of
00:03:39.920 | configurations of the space variables that we want to model.
00:03:44.120 | The number of values that all of the variables that we observe can take
00:03:48.800 | is gonna be exponentially large in general because there's a compositional nature.
00:03:55.520 | If each pixel can take two values and you got a million pixels,
00:03:58.360 | then you got two to one million number of possible images.
00:04:01.240 | So the only way to beat an exponential is to use another exponential.
00:04:07.440 | So we need to make our models compositional.
00:04:11.800 | We need to build our models in such a way that they can represent
00:04:16.680 | functions that look very complicated.
00:04:18.680 | But yet, these models need to have a reasonably small number of parameters.
00:04:26.440 | Reasonably small in the sense that compared to the number of
00:04:30.240 | configurations of the variables, the number of parameters should be small.
00:04:34.080 | And we can achieve that by composing little pieces together,
00:04:40.720 | composing layers together, composing units on the same layer together.
00:04:45.600 | And that's essentially what's happening with deep learning.
00:04:48.000 | So you actually have two kinds of compositions.
00:04:50.840 | There's the compositions happening on the same layer.
00:04:54.080 | This is the idea of distributed representations,
00:04:57.480 | which I'm gonna try to explain a bit more.
00:05:00.120 | This is what you get when you learn embeddings for words or for images,
00:05:04.360 | representations in general.
00:05:06.400 | And then there's the idea of having multiple levels of representation.
00:05:10.080 | That's the notion of depth.
00:05:12.840 | And there, there is another kind of composition that takes place,
00:05:16.560 | whereas the first one is a kind of parallel composition.
00:05:19.720 | I can choose the values of my different units separately, and
00:05:23.080 | then they together represent an exponentially large number of possible
00:05:27.120 | configurations.
00:05:28.760 | In the second case, there's a sequential composition where I take the output of
00:05:32.240 | one level and I combine them in new ways to build features for
00:05:36.840 | the next level and so on and so on.
00:05:39.960 | So the reason deep learning is working
00:05:45.800 | is because the world around us is better modeled by making these assumptions.
00:05:53.040 | It's not necessarily true that deep learning is gonna work for
00:05:55.920 | any machine learning problem.
00:05:57.440 | In fact, if we consider the set of all possible distributions that we would like
00:06:01.000 | to work from, deep learning is no better than any other.
00:06:06.120 | And that's basically what the no free lunch theorem is saying.
00:06:09.880 | It's because we are incredibly lucky that we live in this world,
00:06:14.040 | which can be described by using composition, that these algorithms are working so well.
00:06:18.560 | This is important to really understand this.
00:06:23.160 | So before I go a bit more into distributed representations,
00:06:31.720 | let me say a few words about non-distributed representations.
00:06:34.000 | So if you're thinking about things like clustering, N-grams for
00:06:38.360 | language modeling, classical nearest neighbors, SVMs with Gaussian kernels,
00:06:45.120 | classical non-parametric models with local kernels, decision trees,
00:06:52.280 | all these things, the way these algorithms really work is actually pretty
00:06:57.960 | straightforward if you cut the crap and hide the math and
00:07:03.160 | try to understand what is going on.
00:07:05.040 | They look at the data in the data space and
00:07:10.360 | they break that space into regions.
00:07:12.960 | And they're gonna use different free parameters for
00:07:17.080 | each of those regions to figure out what the right answer should be.
00:07:20.240 | The right answer, it doesn't have to be supervised learning.
00:07:21.920 | Even in unsupervised learning, there's a right answer.
00:07:23.600 | It might be the density or something like that.
00:07:25.200 | Okay, and you might think that that's the only way of solving a problem.
00:07:31.360 | We consider all of the cases and we have an answer for each of the cases.
00:07:36.160 | And we can maybe interpolate between those cases that we've seen.
00:07:39.160 | The problem with this is somebody comes up with a new example which isn't
00:07:46.520 | in between two of the examples we've seen, something that requires us to extrapolate.
00:07:51.520 | Something that's a non-trivial generalization.
00:07:54.320 | And these algorithms just fail.
00:07:56.200 | They don't really have a recipe for
00:07:58.360 | saying something meaningful away from the training examples.
00:08:02.240 | There's another interesting thing to note here,
00:08:07.000 | which I would like you to keep in mind before I show the next slide,
00:08:10.400 | which is in red here, which is we can do a kind of simple counting
00:08:16.880 | to relate the number of parameters, the number of free parameters that can be
00:08:21.480 | learning, and the number of regions in the data space that we can distinguish.
00:08:28.440 | So here, we basically have a linear relationship between these two things.
00:08:33.440 | So for each region, I'm gonna need at least something like some kind of
00:08:38.960 | center for the region, and maybe if I need to output something,
00:08:42.120 | I'll need an extra set of parameters to tell me what the answer should be in that area.
00:08:48.040 | So the number of parameters grows linearly with the number of regions
00:08:51.640 | that I'm gonna be able to distinguish.
00:08:55.480 | The good news is I can have any kind of function, right?
00:08:58.320 | So I can break up the space in any way I want, and
00:09:00.520 | then for each of those regions, I can have any kind of output that I need.
00:09:03.800 | So for decision trees, the regions would be splitting across axes and so on, and
00:09:10.840 | for, this is more like for nearest neighbor or something like that.
00:09:14.560 | Now, what's going on?
00:09:20.920 | Ah, another bug.
00:09:47.960 | Okay, so here's the point of view of distributed
00:09:53.320 | representations for solving the same general machine learning problem.
00:10:00.520 | We have a data space and we wanna break it down, but
00:10:03.800 | we're gonna break it down in a way that's not general.
00:10:07.680 | We're gonna break it down in a way that makes assumptions about the data, but
00:10:12.840 | it's gonna be compositional and it's going to allow us to be exponentially more efficient.
00:10:17.800 | So how are we gonna do this?
00:10:19.360 | So in the picture on the right, what you see is a way to break
00:10:24.160 | the input space by the intersection of half planes.
00:10:27.960 | And this is the kind of thing you would have with what happens at the first layer
00:10:31.920 | of a neural net.
00:10:33.360 | So here, imagine the input is two dimensional, so I can plot it here, and
00:10:37.240 | I have three binary hidden units, C1, C2, C3.
00:10:41.960 | So because they're binary, you can think of them as little binary classifiers.
00:10:47.800 | And because it's only a one layer net,
00:10:50.880 | you can think of what they're doing as a linear classification.
00:10:55.040 | And so those colored hyperplanes here are the decision surfaces for each of them.
00:11:01.000 | Now, these three bits, they can take eight values, right,
00:11:06.320 | corresponding to whether each of them is on or off.
00:11:10.160 | And those different configurations of those bits correspond to
00:11:14.320 | actually seven regions here, because there's one of the eight regions which
00:11:19.920 | is not feasible.
00:11:20.680 | So now you see that we are defining a number of regions which is corresponding
00:11:27.520 | to all of the possible intersections of the corresponding half planes.
00:11:31.160 | And now we can play the game of how many regions do we get for
00:11:37.920 | how many parameters?
00:11:39.240 | And what we see is that if we played the game of growing the number of
00:11:43.920 | dimensions, of features, and also of inputs, we can get an exponentially large
00:11:50.000 | number of regions, which are all of these intersections, right?
00:11:52.600 | There's an exponential number of these intersections
00:11:55.080 | corresponding to different binary configurations.
00:11:59.640 | Yet the number of parameters grows linearly with the number of units.
00:12:02.800 | So it looks like we're able to express a function.
00:12:07.040 | And then on top of that, I could imagine you have a linear classifier, right?
00:12:09.880 | That's the one hidden layer neural net.
00:12:13.360 | So the number of parameters grows just linearly with the number of features.
00:12:20.880 | But the number of regions that the network can really provide a different
00:12:23.960 | answer to grows exponentially.
00:12:25.960 | So this is very cool.
00:12:31.000 | And the reason it's very cool is that it allows those neural nets to generalize.
00:12:36.080 | Because while we're learning about each of those features,
00:12:40.760 | we can generalize to regions we've never seen because we've learned enough
00:12:47.840 | about each of those features separately.
00:12:50.360 | I'm going to give you an example of this in a couple of slides.
00:12:52.680 | Actually, let's do it first.
00:12:57.560 | So think about those features, let's say the input is an image of a person.
00:13:04.080 | And think of those features as things like,
00:13:07.160 | I have a detector that says that the person wears glasses.
00:13:11.240 | And I have another unit that's detecting that the person is a female or male.
00:13:16.800 | And I have another unit that detects that the person is a child or not.
00:13:20.080 | And you can imagine hundreds or thousands of these things, of course.
00:13:23.120 | So the good news is you could imagine learning about each of these
00:13:34.000 | feature detectors, these little classifiers, separately.
00:13:37.920 | In fact, you could do better than that.
00:13:40.200 | You could share intermediate layers between the input and those features.
00:13:44.520 | But let's take even the worst case and imagine we were to train those separately,
00:13:48.840 | which is the case in the linear model that I showed before.
00:13:51.800 | We have a separate set of parameters for each of these detectors.
00:13:55.600 | So if I have n features, each of them, say, needs order of k parameters.
00:14:01.680 | Then I need order of nk parameters, and I need order of nk examples.
00:14:07.240 | And one thing you should know from machine learning theory is that
00:14:11.480 | if you have order of p parameters,
00:14:18.880 | you need order of p examples to do a reasonable job of generalizing.
00:14:23.120 | You can get around that by regularizing and
00:14:26.960 | effectively having less degrees of freedom.
00:14:28.880 | But to keep things simple, you need about the same number of examples, or
00:14:33.240 | maybe say 100 times more or
00:14:34.640 | 10 times more, as the number of really free parameters.
00:14:38.240 | So now the relationship between the number of regions that I can represent and
00:14:47.680 | the number of examples I need is quite nice because the number of regions is
00:14:52.560 | going to be two to the number of features of these binary features.
00:14:57.040 | So a person could wear glasses or not, be a female or a male, a child or not, and
00:15:01.240 | I could have 100 of these things.
00:15:03.120 | And I could probably recognize reasonably well all of these 2 to
00:15:08.000 | the 100 configurations of people,
00:15:10.440 | even though I've obviously not seen all of those 2 to 100 configurations.
00:15:15.840 | Why is it that I'm able to do that?
00:15:18.120 | I'm able to do that because the models can learn about each of these binary
00:15:22.280 | features kind of independently in the sense that I don't need to see
00:15:26.160 | every possible configuration of the other features to know about wearing glasses.
00:15:32.640 | I can learn about wearing glasses even though I've never seen
00:15:37.680 | somebody who was a female and a child and chubby and had yellow shoes.
00:15:45.400 | And I have seen enough examples of people wearing glasses,
00:15:49.920 | I can learn about wearing glasses in general.
00:15:52.600 | I don't need to see all of the configurations of the other features to
00:15:55.200 | learn about one feature.
00:15:56.160 | Okay?
00:15:58.520 | And so this is really why this thing works,
00:16:03.600 | is because we're making assumptions about the data
00:16:08.240 | that those features are meaningful by themselves.
00:16:11.400 | And you don't need to actually have data for each of the regions,
00:16:15.960 | the exponential number of regions, in order to learn
00:16:20.520 | the proper way of detecting or of discovering these intermediate features.
00:16:26.240 | Let me add something here.
00:16:30.760 | There were some experiments recently actually showing that
00:16:34.960 | this kind of thing is really happening.
00:16:37.920 | Because the features I was talking about, not only I'm assuming that they exist,
00:16:46.040 | but the optimization methods, the training procedures discover them.
00:16:50.880 | They can learn them.
00:16:51.640 | And this is an experiment that's been done in Antonio Toralba's lab at MIT,
00:17:00.800 | where they trained a usual ConvNet to recognize places.
00:17:06.880 | So the outputs of the net are just the types of places,
00:17:10.880 | like is this a beach scene or an office scene or a street scene and so on?
00:17:15.120 | But then the thing they've done is they asked people to analyze the hidden units
00:17:20.480 | to try to figure out what each hidden unit was doing.
00:17:22.600 | And they found that there's a large proportion of units that humans can find
00:17:27.040 | a pretty obvious interpretation for what those units like.
00:17:29.800 | So they see a bunch of units which like people or
00:17:37.080 | different kinds of people or animals or buildings or seatings or tables,
00:17:41.320 | lighting and so on.
00:17:42.600 | So it's like if indeed those neural nets are discovering semantic features.
00:17:48.720 | They're semantic because actually people give them names as the intermediate
00:17:53.440 | features in order to reach the final goal of here, classifying scenes.
00:17:57.640 | And the reason they're generalizing is because now you can combine those
00:18:03.320 | features in an exponentially large number of ways.
00:18:05.560 | You could have a scene that has a table, a different kind of lighting,
00:18:10.960 | some people, maybe a pet.
00:18:13.960 | And you can say something meaningful about the combinations of these things.
00:18:19.160 | Because the network is able to learn all of these features without having to
00:18:24.040 | see all of the possible configurations of them.
00:18:26.160 | So I don't know if my explanation makes sense to you, but
00:18:30.480 | now's your chance to ask me a question.
00:18:32.520 | All clear?
00:18:37.840 | Usually it's not.
00:18:40.480 | Yeah?
00:18:41.480 | >> So with one decision piece you can kind of do this the same as well, right?
00:18:46.520 | >> With decision trees?
00:18:47.720 | >> If you have a set of decision trees.
00:18:51.000 | >> Right, to some extent.
00:18:52.840 | So the question is, can't we do the same thing with a set of decision trees?
00:18:57.640 | Yeah, in fact, this is one of the reasons why forests work better or
00:19:02.520 | bagged trees work better than single trees.
00:19:04.920 | Forests are actually, or bagged trees are one level deeper than a single trees.
00:19:11.280 | But they still don't have as much of a sort of distributed aspect as neural nets.
00:19:20.640 | And usually they're not trained jointly.
00:19:23.640 | I mean, boosted trees are, to some extent, in a greedy way.
00:19:27.920 | But yeah, any other question?
00:19:31.240 | Yeah?
00:19:32.180 | >> Do you find cases that are non-compositional?
00:19:35.580 | >> Cases where what?
00:19:37.220 | >> Do you find non-compositional cases?
00:19:40.060 | >> Non-conditional?
00:19:41.260 | >> Non-compositional.
00:19:42.740 | >> Non-computer vision.
00:19:43.580 | >> Non-compositional.
00:19:45.940 | >> Non-compositional.
00:19:47.700 | I don't understand the question.
00:19:48.700 | I mean, I don't understand what you mean.
00:19:49.980 | What do you mean non-compositional?
00:19:51.100 | >> You're talking about compositionality here.
00:19:53.260 | >> Yeah, it's everywhere around us.
00:19:55.780 | I don't think that there are examples of neural nets that really work well where
00:19:59.420 | the data doesn't have some kind of compositional structure in it.
00:20:02.140 | But if you come up with an example, I'd like to hear about it.
00:20:05.380 | Yes, yes?
00:20:09.380 | >> So in the language of rock models,
00:20:12.340 | do you mean that we're facing a model of this rock?
00:20:16.780 | And in the real world, we're trying to look for some independent,
00:20:21.380 | but we cannot get independent, but it starts somewhere with a very small square.
00:20:26.020 | >> To think about this issue in graphical model terms can be done.
00:20:35.500 | But you have to think about not feature detection,
00:20:39.540 | like I've been doing here, but about generating an image or something like that.
00:20:45.500 | Then it's easier to think about it.
00:20:46.620 | So the same kinds of things happen if you think about how I could generate an image.
00:20:52.740 | If you think about underlying factors like which objects, where they are,
00:20:56.820 | what's their identity, what's their size, these are all independent factors,
00:21:01.340 | which you compose together in funny ways.
00:21:04.900 | If you were to do a graphics engine, you can see exactly what those ways are.
00:21:07.580 | And it's much, much easier to represent that joint of distribution
00:21:14.500 | using this compositional structure than if you're trying to
00:21:19.060 | work directly in pixel space, which is normally what you would do with
00:21:23.620 | a classical non-parametric method, and it wouldn't work.
00:21:27.100 | But if you look at our best deep generative models now for images, for
00:21:30.420 | example, like GANs or VAEs, they're really, we're not there yet, but
00:21:37.140 | they're amazingly better than anything that people could dream of just a few years ago
00:21:41.340 | in machine learning.
00:21:42.180 | Okay, let me move on, because I have other things to talk about.
00:21:49.540 | So this is all kind of hand wavy, but
00:21:52.660 | some people have done some math around these ideas.
00:21:57.020 | And so for example, there's one result from two years ago,
00:22:05.260 | right here, where we studied the single layer case.
00:22:10.380 | And we consider a network with rectifiers,
00:22:15.900 | and we find that the network,
00:22:20.860 | of course, computes a piecewise linear function.
00:22:26.300 | And so one way to quantify the richness of the function that it can compute,
00:22:31.340 | I was talking about regions here, but well, you can do the same thing here.
00:22:35.580 | You can count how many pieces does this network have in its input to output function.
00:22:43.020 | And it turns out that it's exponential
00:22:46.540 | in the number of inputs, well, it's number of units to the power number of inputs.
00:22:54.820 | So that's for sort of distributed representation,
00:22:59.740 | there's an exponential kicking in.
00:23:02.420 | We also studied the depth aspect.
00:23:05.420 | So what you need to know about depth is that
00:23:10.980 | there's a lot of earlier theory that says that a single layer is sufficient
00:23:15.140 | to represent any function.
00:23:17.100 | However, that theory doesn't specify how many units you might need.
00:23:20.140 | And in fact, you might need an exponentially large number of units.
00:23:24.860 | So what several results show is that there are functions
00:23:32.220 | that can be represented very efficiently with few units, so few parameters.
00:23:39.980 | If you allow the network to be deep enough.
00:23:41.980 | So out of all the functions, again, it's a luckiness thing, right?
00:23:46.580 | Out of all the functions that exists,
00:23:48.420 | there's a very, very small fraction,
00:23:52.100 | which happened to be very easy to represent with a deep network.
00:23:56.260 | And if you try to represent these functions with a shallow network,
00:24:02.860 | you're screwed.
00:24:04.260 | You're gonna need an exponential number of parameters.
00:24:08.180 | And so you're gonna need an exponential number of examples to learn these things.
00:24:11.700 | But again, we're incredibly lucky that the function we want to learn
00:24:16.740 | have this property.
00:24:17.540 | But in a sense, it's not surprising.
00:24:19.980 | I mean, we use this kind of compositionality and depth everywhere.
00:24:23.180 | When we write a computer program, we just don't have a single main.
00:24:26.860 | We have functions and call functions.
00:24:29.180 | And we were able to show similar things as what I was telling you about for
00:24:34.940 | the single layer case, that as you increase depth for
00:24:40.300 | these deep ReLU networks, the number of pieces in the piecewise
00:24:46.100 | linear function grows exponentially with the depth.
00:24:48.300 | So it's already exponentially large with a single layer, but
00:24:53.340 | it gets exponentially even more with a deeper net.
00:24:58.340 | Okay, so this was a topic of representation of functions.
00:25:03.620 | Why deep learn, deep architectures can be very powerful if we're lucky,
00:25:09.860 | and we seem to be lucky.
00:25:10.820 | Another topic I wanna mention that's kind of very much in the foundations is
00:25:21.060 | how is it that we're able to train these neural nets in the first place?
00:25:25.300 | In the 90s, many people decided to not do any more research on neural nets,
00:25:31.540 | because there were theoretical results showing that there are really
00:25:35.980 | an exponentially large number of local minima in the training objective of a neural net.
00:25:43.660 | So in other words, the function we wanna learn has many of these holes,
00:25:49.820 | and if we start at a random place, well, what's the chance we're gonna find
00:25:54.340 | the best one, the one that corresponds to a good cost?
00:25:58.740 | And that was one of the motivations for people who flocked into a very large area
00:26:05.140 | of research in machine learning in the 90s and 2000s,
00:26:08.420 | based on algorithms that require only convex optimization to train.
00:26:13.340 | Cuz of course, if we can do convex optimization, we eliminate this problem.
00:26:17.460 | If the objective function is convex in the parameters,
00:26:20.140 | then we know there's a single global minimum.
00:26:24.620 | Right, so let me show you a picture here, you get a sense of,
00:26:30.380 | if you look on the right hand top, this is, if you draw a random function in 1D or
00:26:37.300 | 2D or 3D, like here is kind of a random smooth function in 2D,
00:26:41.540 | you see that it's gonna have many ups and downs.
00:26:45.420 | These are local minima.
00:26:49.620 | But the good news is that in high dimension, it's a totally different story.
00:26:56.020 | So what are the dimensions here?
00:26:57.220 | We're talking about the parameters of the model, and
00:27:00.020 | the vertical axis is the cost that we're trying to minimize.
00:27:03.740 | And what happens in high dimension is that instead of having
00:27:08.860 | a huge number of local minima on our way when we're trying to optimize,
00:27:14.820 | what we encounter instead is a huge number of saddle points.
00:27:18.620 | So saddle point is like the thing on the bottom right in 2D.
00:27:23.540 | So you have two parameters and the y-axis is the cost you wanna minimize.
00:27:27.140 | And so what you see in a saddle point is you have dimensions or
00:27:30.980 | directions where the objective function draws a minimum.
00:27:37.700 | So there's like a curve that, it curves up.
00:27:42.100 | And in other directions, it curves down.
00:27:45.060 | So saddle point has both a minimum in some direction and
00:27:48.820 | a maximum in other directions.
00:27:49.860 | So this is interesting because even though it's a,
00:27:57.780 | these points, like saddle points and minima are places where you could get stuck.
00:28:05.260 | In principle, if you're exactly at the saddle point, you don't move.
00:28:08.100 | But if you move a little bit away from it, you will go down the saddle, right?
00:28:11.380 | So what our work and
00:28:18.900 | other work from NYU, Chormanska and
00:28:23.700 | collaborators of Yan LeCun showed is that actually
00:28:29.340 | in very high dimension, not only it's the issue is more saddle points than local minima.
00:28:39.460 | But the local minima are good.
00:28:43.060 | So let me try to explain what I mean by this.
00:28:45.140 | So let me show you actually first an experiment from the NYU guys.
00:28:54.940 | So they did an experiment where they gradually changed the size of the neural net.
00:29:01.180 | And they look at what looks like local minima, but
00:29:04.580 | they could be saddle points that are the lowest that they could obtain by training.
00:29:09.660 | And what you're looking at is a distribution of
00:29:13.820 | errors they get from different initializations of their training.
00:29:17.860 | And so what happens is that when the network is small,
00:29:21.820 | like the pink here on the right, there's a widespread distribution of
00:29:26.540 | cost that you can get depending on where you start and you're pretty high.
00:29:31.700 | And if you increase the size of the network, it's like all of the local
00:29:36.740 | minima that you find concentrate around a particular cost.
00:29:42.300 | So you don't get any of these bad local minima that you would get with a small
00:29:47.220 | network, they're all kind of pretty good.
00:29:50.140 | And if you increase even more the size of the network,
00:29:51.700 | this is like a single hidden layer network, not very complicated.
00:29:55.140 | This phenomenon increases even more.
00:29:58.060 | In other words, they all kind of converge to the same kind of cost.
00:30:01.940 | So let me try to explain what's going on.
00:30:04.860 | So if we go back to the picture of the saddle point, but
00:30:08.980 | instead of being in 2D, imagine you are in a million D.
00:30:12.460 | And in fact, people have billion D networks these days.
00:30:16.420 | I'm sure Andrew has even bigger ones, I'm sure.
00:30:25.340 | high dimensional space of parameters is that,
00:30:30.700 | if things are not really bad for you, so
00:30:35.620 | if you imagine a little bit of randomness in the way the problem is set up, and
00:30:39.300 | it seems to be the case, in order to have a true local minimum,
00:30:45.180 | you need to have the curvature going up like this in all the billion directions.
00:30:51.540 | So if there is a certain probability of this event happening,
00:30:56.700 | that this particular direction is curving up and this one is curving up,
00:31:00.020 | the probability that all of them curve up becomes exponentially small.
00:31:04.180 | So we tested that experimentally.
00:31:09.100 | What you see in the bottom left is a curve that shows the training error
00:31:16.220 | as a function of what's called the index of the critical point,
00:31:21.660 | which is just the fraction of the directions
00:31:25.860 | which are curving down, right?
00:31:31.780 | So 0% would mean it's a local minimum.
00:31:36.700 | 100% would be it's a local maximum, and anything in between is a saddle point.
00:31:43.060 | So what we find is that as training progresses,
00:31:49.340 | we're going close to a bunch of saddle points, and
00:31:53.460 | none of them are local minima, otherwise we would be stuck.
00:31:58.060 | And in fact, we never encounter local minima until
00:32:04.380 | we reach the lowest possible cost that we're able to get.
00:32:08.500 | In addition, there is a theory suggesting that, so
00:32:14.500 | the local minima will actually be close in cost to the global minimum.
00:32:23.420 | They will be above, and
00:32:25.420 | they will concentrate in a little band above the global minimum.
00:32:28.420 | But that band of local minima will be close to the global minimum.
00:32:36.020 | And the larger the dimension, the more this is gonna be true.
00:32:39.700 | So to go back to my analogy, right?
00:32:42.420 | At some point, of course, you will get local minima,
00:32:45.660 | even though it's unlikely when you're in the middle.
00:32:48.660 | When you get close to the bottom, well, you can't go lower.
00:32:51.380 | So it has to rise up in all the directions.
00:32:53.940 | But it's, yeah.
00:32:56.100 | So that's kind of good news.
00:32:58.500 | I think, in spite of this,
00:33:00.340 | I don't think that the optimization problem of neural nets is solved.
00:33:03.660 | There are still many cases where we find ourselves to be stuck.
00:33:07.500 | And we still don't understand what the landscape looks like.
00:33:10.900 | There's a set of beautiful experiments by Ian Goodfellow
00:33:13.660 | that help us visualize a bit what's going on.
00:33:16.540 | But I think one of the open problems of optimization for
00:33:18.940 | neural nets is, what does the landscape actually look like?
00:33:23.300 | It's hard to visualize, of course, because it's very high dimensional.
00:33:26.060 | But for example, we don't know what those saddle points really look like.
00:33:32.420 | When we actually measure the gradient near those,
00:33:37.580 | when we are approaching those saddle points, it's not close to zero.
00:33:40.420 | So we never go to actually flat places.
00:33:42.940 | This may be due to the fact that we're using SGD and
00:33:45.580 | it's kind of hovering above things.
00:33:47.340 | There might be conditioning issues where even if you are at a saddle,
00:33:51.220 | near a saddle point, you might be stuck, even though it's not a local minimum.
00:33:54.420 | Because in many directions, it's still going up,
00:33:58.380 | maybe 95% of the directions.
00:34:01.580 | And the other directions are hard to reach because simply,
00:34:06.500 | there's a lot more curvature in some directions than other directions.
00:34:09.460 | And that's the traditional ill conditioning problem.
00:34:13.700 | We don't know exactly what's making it hard to train some networks.
00:34:18.300 | Usually, conv nets are pretty easy to train.
00:34:20.940 | But when you go into things like machine translation or
00:34:23.780 | even worse, reasoning tasks with things like neural training machines and
00:34:27.900 | things like that, it gets really, really hard to train these things.
00:34:30.580 | And people have to use all kinds of tricks like curriculum learning,
00:34:33.700 | which are essentially optimization tricks, to make the optimization easier.
00:34:37.940 | So I don't want to tell you that, the optimization problem of neural nets is
00:34:42.980 | easy, it's done, we don't need to worry about it.
00:34:45.340 | But it's much easier and
00:34:47.020 | less of a concern than what people thought in the 90s.
00:34:52.820 | Okay, so.
00:34:56.740 | So machine learning, I mean, deep learning is moving out of pattern recognition and
00:35:05.500 | into more complicated tasks, for example, including reasoning and
00:35:09.900 | combining deep learning with reinforcement learning, planning, and
00:35:13.300 | things like that.
00:35:13.860 | You've heard about attention.
00:35:16.540 | That's one of the tools that is really, really useful for many of these tasks.
00:35:22.980 | We've sort of come up with attention
00:35:28.580 | mechanisms as not a way to focus on what's going on in the outside world.
00:35:33.820 | Like we usually think of attention like attention in the visual space, but
00:35:37.380 | internal attention, right?
00:35:38.900 | In the space of representations that have been built.
00:35:41.460 | So that's what we do here in machine translation.
00:35:45.020 | And it's been extremely successful, as Quark said.
00:35:49.420 | So I'm not gonna show you any of these pictures, blah, blah, blah.
00:35:55.060 | So I'm getting more now into the domain of challenges.
00:36:00.620 | A challenge that I've been working on since I was a baby researcher as a PhD
00:36:05.020 | student is long term dependencies and recurrentness.
00:36:10.740 | And although we've made a lot of progress,
00:36:13.500 | this is still something that we haven't completely cracked.
00:36:18.020 | And it's connected to the optimization problem that I told you before, but
00:36:21.980 | it's a very particular kind of optimization problem.
00:36:26.180 | So some of the ideas that we've used to try to make
00:36:31.660 | the propagation of information and gradients easier include
00:36:37.100 | using skip connections over time, include using multiple time scales.
00:36:42.660 | There's some recent work in this direction from my lab and other groups.
00:36:46.220 | And even the attention mechanism itself,
00:36:50.420 | you can think of a way to help dealing with long term dependency.
00:36:56.180 | So the way to see this is to think of
00:37:02.780 | the place on which we're putting attention as part of the state.
00:37:06.060 | So imagine really you have a recurrent net and it has two kinds of state.
00:37:12.500 | It has the usual recurrent net state, but it has the content of the memory.
00:37:17.620 | Kwok told you about memory nets and neural train machines.
00:37:20.340 | And the full state really includes all of these things.
00:37:24.140 | And now we're able to read or write from that memory.
00:37:29.660 | I mean, the little recurrent net is able to do that.
00:37:32.060 | So what happens is that there are memory elements
00:37:38.460 | which don't change over time, maybe they've been written once.
00:37:43.740 | And so the information that has been stored there, it can stay for
00:37:48.180 | as much time as they're not gonna be overwritten.
00:37:51.700 | So that means that if you consider the gradients back
00:37:57.060 | propagated through those cells, they can go pretty much unhampered and
00:38:01.980 | there's no vanishing gradient problem.
00:38:03.420 | So this is something that could be, that view of the problem of long term
00:38:09.380 | dependencies with memory I think could be very useful.
00:38:13.700 | All right, in the last part of my presentation, I wanna tell you about
00:38:17.780 | what I think is the biggest challenge ahead of us, which is unsupervised learning.
00:38:21.860 | Any question about attention and memory before I move on to unsupervised learning?
00:38:27.620 | Okay, so why do we care about unsupervised learning?
00:38:48.860 | At least not in an obvious way.
00:38:51.540 | There are less obvious ways where unsupervised learning is actually already
00:38:55.780 | extremely successful.
00:38:56.860 | So for example, when you train word embeddings with Word2Vec or
00:38:59.700 | any other model and you use that to pre-train,
00:39:02.500 | like we did our machine translation systems or other kinds of NLP tasks.
00:39:06.780 | You're exploiting unsupervised learning.
00:39:08.900 | Even when you train a language model that you're gonna stick in some other thing or
00:39:15.460 | pre-train something with that, you're also doing unsupervised learning.
00:39:20.100 | But I think the potential of and
00:39:26.900 | the importance of unsupervised learning is usually underrated.
00:39:30.860 | So why do we care?
00:39:36.580 | First of all, the idea of unsupervised learning is that we can train,
00:39:39.580 | we can learn something from large quantities of unlabeled data that humans
00:39:43.500 | have not curated, and we have lots of that.
00:39:46.940 | Humans are very good at learning from unlabeled data.
00:39:53.500 | I have an example that I use often that makes it very, very clear that,
00:40:02.540 | for example, children can learn all kinds of things about the world,
00:40:05.940 | even though no one, no adult ever tells them
00:40:11.220 | anything about it until much later when it's too late.
00:40:17.420 | So a two or three year old understands physics.
00:40:22.300 | If she has a ball, she knows what's gonna happen when she drops the ball.
00:40:28.100 | She knows how liquids behave.
00:40:30.700 | She knows all kinds of things about objects and ordinary Newtonian physics,
00:40:37.020 | even though she doesn't have explicit equations and a way to describe them
00:40:40.500 | with words, but she can predict what's gonna happen next, right?
00:40:45.300 | And the parents don't tell the children, force equals mass times acceleration.
00:40:55.620 | this is purely unsupervised, and it's very powerful.
00:40:58.820 | We don't even have that right now.
00:41:00.060 | We don't have computers that can understand the kinds of physics that
00:41:03.260 | children can understand.
00:41:04.140 | So it looks like it's a skill that humans have, and that's very important for
00:41:12.220 | humans to make sense of the world around us, but
00:41:15.900 | we haven't really yet succeeded to put in machines.
00:41:18.340 | Let me tell you other reasons that are connected to this,
00:41:23.940 | why unsupervised learning could be useful.
00:41:25.940 | When you do supervised learning,
00:41:28.220 | essentially the way you train your system is you focus on a particular task.
00:41:33.020 | It goes, here's the inputs, and here's the input variables, and
00:41:36.380 | here's an output variable that I would like you to predict given the input.
00:41:39.460 | You're learning P of Y given X.
00:41:40.820 | But if you're doing unsupervised learning, essentially you're learning about
00:41:45.660 | all the possible questions that could be asked about the data that you observe.
00:41:50.340 | So it's not that there's X1, X2, X3, and Y.
00:41:54.260 | Everything is an X, and
00:41:55.460 | you can predict any of the X given any of the other X, right?
00:41:58.860 | If I give you a picture and I hide a part of it, you can guess what's missing.
00:42:01.980 | If I hide the caption, you can generate the caption given the image.
00:42:09.540 | If I hide the image and I give you the caption, you can guess what the image
00:42:14.260 | would be or draw it or figure out from examples which one is the most appropriate.
00:42:18.700 | So you can answer any questions about the data
00:42:22.100 | when you have captured the joint distribution between them, essentially.
00:42:25.620 | So that could be useful.
00:42:28.340 | Another practical thing that unsupervised learning has been used,
00:42:35.820 | in fact, this is how the whole deep learning thing started,
00:42:38.100 | is that it could be used as a regularizer.
00:42:41.420 | Because in addition to telling our model that we want to predict Y given X,
00:42:50.920 | we're saying find representations of X that both predict Y and
00:42:58.180 | somehow capture something about the distribution of X,
00:43:01.780 | the leading factors, the explanatory factors of X.
00:43:06.140 | And this, again, is making an assumption about the data, so
00:43:09.060 | we can use that as a regularizer if the assumption is valid.
00:43:11.700 | Essentially, the assumption is that the factor Y that we're trying to predict
00:43:18.460 | is one of the factors that explain X.
00:43:21.360 | And that by doing unsupervised learning to discover factors that explain X,
00:43:26.000 | we're gonna pick Y among the other factors.
00:43:29.320 | And so it's gonna be much easier now to do supervised learning.
00:43:32.120 | Of course, this is also the reason why transfer learning works,
00:43:37.600 | because there are underlying factors that explain the inputs for a bunch of tasks.
00:43:44.600 | And maybe a different subset of factors explain are relevant for one task, and
00:43:48.980 | another subset of factors is relevant for another task.
00:43:51.340 | But if these factors overlap,
00:43:53.460 | then there's a potential for synergy by doing multitask learning.
00:43:59.020 | So the reason multitask learning is working is because unsupervised learning
00:44:02.940 | is working, is because there are representations and
00:44:08.300 | factors that explain the data that can be useful for
00:44:11.620 | our supervised learning tasks of interest.
00:44:15.120 | That also could be used for domain adaptation for the same reason.
00:44:18.360 | The other thing that people don't talk about as much about unsupervised learning,
00:44:26.960 | and I think it was part of the initial success that we had with stacking auto
00:44:30.760 | encoders and RBMs, is that you can actually make the optimization problem of
00:44:35.880 | training deep nets easier.
00:44:38.760 | Cuz if you're gonna, for the most part,
00:44:43.820 | if you're gonna train a bunch of RBMs or a bunch of auto encoders, and
00:44:48.260 | I'm not saying this is the right way of doing it, but
00:44:50.540 | it captures some of the spirit of what unsupervised learning does.
00:44:54.020 | A lot of the learning can be done locally.
00:44:55.700 | You're trying to extract some information,
00:44:57.580 | you're trying to discover some dependencies, that's a local thing.
00:45:00.340 | Once we have a slightly better representation, we can again tweak it to
00:45:03.540 | extract better, more independence, or something like that.
00:45:06.460 | So there's a sense in which the optimization problem
00:45:09.680 | might be easier if you have a very deep net.
00:45:11.480 | Another reason why we should care about unsupervised learning,
00:45:16.400 | even if our ultimate goal is to do supervised learning,
00:45:19.520 | is because sometimes the output variables are complicated.
00:45:24.480 | They are compositional.
00:45:26.640 | They have a joint distribution.
00:45:28.520 | So in machine translation, which we talked about, the output is a sentence.
00:45:33.080 | A sentence is a set of, is a tuple of words that have a complicated joint
00:45:36.740 | distribution given the input in the other language.
00:45:39.260 | And so it turns out that many of the things we discover by exploring
00:45:44.060 | unsupervised learning, which is essentially about capturing joint distributions,
00:45:48.180 | can be often used to deal with these structured output problems where
00:45:53.820 | you have many outputs that form a compositional, complicated distribution.
00:46:01.500 | There's another reason why unsupervised learning, I think,
00:46:04.000 | is going to be really necessary for AI.
00:46:05.960 | Model-based reinforcement learning.
00:46:11.120 | So I think I have another slide just for this.
00:46:16.520 | Let's think about self-driving cars.
00:46:23.760 | This is a very popular topic these days.
00:46:29.080 | How did I learn that I shouldn't do some things with the wheel
00:46:33.720 | that will kill myself when I'm driving?
00:46:36.360 | Because I haven't experienced these states where I get killed.
00:46:41.800 | And I simply haven't done it like a thousand times to learn how to avoid it.
00:46:45.440 | So supervised learning where our rather traditional reinforcement learning,
00:46:53.000 | like policy learning kind of thing,
00:46:58.680 | or actor-critic or things like that, won't work because
00:47:03.560 | I need to generalize about situations that
00:47:10.440 | I'm never going to encounter, because otherwise if I did, I would die.
00:47:13.320 | So these are like dangerous states that I need to generalize about these states,
00:47:21.200 | but I can't have enough data for them.
00:47:25.360 | And I'm sure there are lots of machine learning applications where
00:47:28.440 | we would be in that situation.
00:47:30.600 | I remember a couple of decades ago, I got some data from a nuclear plant.
00:47:36.880 | And so they wanted to predict that when it's going to blow up.
00:47:40.520 | >> [LAUGH] >> To avoid it.
00:47:45.320 | So I said, how many examples?
00:47:48.200 | >> [LAUGH] >> I said zero.
00:47:54.440 | Right, so you see, sometimes it's hard to do supervised learning,
00:47:58.520 | because the data you'd like to have, you can't have.
00:48:01.360 | It's data that situations that are very rare.
00:48:04.280 | So how can we possibly solve this problem?
00:48:08.520 | Well, the only solution I can see is that we learn enough about the world
00:48:13.760 | that we can predict how things would unfold.
00:48:16.600 | When I'm driving, I have a kind of mental model of physics and
00:48:21.280 | how cars behave that I can figure out if I turn to right at this point,
00:48:25.880 | I'm going to end up on the wall and this is going to be very bad for me.
00:48:29.200 | And I don't need to actually experience that to know that it's bad.
00:48:32.040 | I can make a mental simulation of what would happen.
00:48:36.440 | So I need a kind of generative model of how the world would unfold
00:48:41.560 | if I do such and such actions.
00:48:43.200 | And unsupervised learning is sort of the ideal thing to do that.
00:48:48.120 | But of course, it's going to be hard because we're going to have to train
00:48:51.400 | models that capture a lot of aspects of the world in order to be able to learn
00:48:57.680 | to generalize properly in those situations even though they don't see any data of it.
00:49:02.720 | So that's one reason why I think reinforcement learning needs to be
00:49:13.000 | worked on more.
00:49:18.000 | So I have a little thing here.
00:49:20.200 | I think people who have been doing deep learning can collaborate with people
00:49:26.360 | who are doing reinforcement learning and
00:49:28.400 | not just by providing a black box that they can use in their usual algorithms.
00:49:32.960 | I think there are things that we do in supervised deep learning that, or
00:49:38.280 | unsupervised deep learning, that can be useful in sort of rethinking
00:49:43.200 | our reinforcement learning.
00:49:44.120 | So one example, well, one thing I really like to think about is credit assignment.
00:49:52.520 | In other words, how do different machine learning algorithms figure out what
00:49:56.640 | the hidden units are supposed to do, what the intermediate computations or
00:50:00.120 | the intermediate actions should be?
00:50:01.760 | This is what credit assignment is about.
00:50:03.200 | And that prop is the best recipe we currently have for doing credit assignment.
00:50:09.520 | It tells the parameters of some intermediate layer how they should change
00:50:14.200 | so that the cost much, much later, 100 steps later,
00:50:17.560 | if it's a recurrent net, should be reduced.
00:50:20.320 | So we could probably use some inspiration from back prop and
00:50:27.440 | how it's used to improve reinforcement learning.
00:50:31.960 | And one such cue is how
00:50:38.600 | when we do supervised back prop, say, we don't predict
00:50:44.680 | the expected loss that we're gonna have and then try to minimize it.
00:50:50.080 | Where the expectation would be over the different realizations of the correct
00:50:55.120 | class.
00:50:55.760 | That's not what we do.
00:50:56.880 | But this is what people do in RL.
00:50:58.160 | They will learn a critic or a cue function,
00:51:04.240 | which is learning the expected value of the future reward or future loss.
00:51:10.000 | In our case, that might be minus log probability of the correct answer given
00:51:14.000 | the input.
00:51:14.500 | And then they will back prop through this or
00:51:19.160 | use it to estimate the gradient on the actions.
00:51:23.640 | Instead, when we do supervised learning, we're gonna do credit assignment
00:51:32.520 | where we use the particular observations of the correct class that actually
00:51:36.440 | happened for this x, right?
00:51:38.760 | We have x, we have y, and
00:51:40.840 | we use the y to figure out how to change our prediction or our action.
00:51:45.280 | So it looks like this is something that should be done for RL.
00:51:52.840 | And in fact, we have a paper on something like this for
00:51:57.160 | a sequence prediction.
00:52:00.560 | This is the kind of work which is at the intersection of dealing with structured
00:52:04.840 | outputs, reinforcement learning, and supervised learning.
00:52:07.440 | So I think there's a lot of potential benefit of
00:52:10.960 | changing the frame of thinking that people in RL have had.
00:52:17.360 | For many decades, people in RL have been not thinking about the world
00:52:22.320 | with the same eyes as people doing neural nets.
00:52:24.400 | They've been thinking about the world in terms of discrete states that
00:52:28.880 | could be enumerated and proving theorems about these algorithms that depend on
00:52:35.240 | essentially collecting enough data to fill all the possible configurations of
00:52:39.000 | the state and their corresponding effects on the reward.
00:52:44.000 | When you start thinking in terms of neural nets and
00:52:47.400 | deep learning, the way to approach problems is very, very different.
00:52:50.360 | Okay, let me continue about unsupervised learning and
00:52:55.760 | why this is so important.
00:52:58.160 | If you look at the kinds of mistakes that our current machine learning algorithms
00:53:03.240 | make, you find that our neural nets are just cheating.
00:53:08.920 | They're using the wrong cues to try to produce the answers.
00:53:14.320 | And sometimes it works, sometimes it doesn't work.
00:53:15.960 | So how can we make our models be smarter, make less mistakes?
00:53:25.440 | Well, The only solution is to make sure that those models
00:53:31.760 | really understand how the world works, at least at the level of humans,
00:53:38.120 | to get human level accuracy, human level performance.
00:53:40.440 | It may be not necessary to do this for a particular problem you're trying to solve.
00:53:46.160 | So maybe we can get away with doing speech recognition without really
00:53:50.720 | understanding the meaning of the words.
00:53:53.360 | Probably that's gonna be okay.
00:53:55.560 | But for other tasks, especially those involving language, I think having
00:54:01.320 | models that actually understand how the world ticks is gonna be very, very important.
00:54:04.640 | So how can we have machines that understand how the world works?
00:54:12.720 | Well, one of the ideas that I've been talking a lot about in the last decade
00:54:18.280 | is that of disentangling factors of variation.
00:54:22.120 | This is related to a very old idea in pattern recognition,
00:54:25.160 | computer vision, called invariance.
00:54:27.720 | The idea of invariance was that we would like to compute or
00:54:31.760 | design, initially design and now learn features, say,
00:54:35.280 | of the image that are invariant to the things we don't care about.
00:54:39.160 | Maybe we wanna do object recognition, so we don't care about position or orientation.
00:54:43.960 | So we would like to have features that are translation invariant,
00:54:46.520 | rotation invariant, scaling invariant, whatever.
00:54:49.480 | So this is what invariance is about.
00:54:50.880 | But when you're in the business of doing unsupervised learning,
00:54:53.800 | of trying to figure out how the world works,
00:54:55.880 | it's not good enough to extract invariant features.
00:54:59.160 | What we actually wanna do is to extract all of the factors that explain the data.
00:55:04.160 | So if we're doing speech recognition, we want not only to extract the phonemes,
00:55:09.280 | but we also want to figure out what kind of voice is that?
00:55:12.600 | Maybe who is it?
00:55:14.120 | What kind of recording conditions or what kind of microphone?
00:55:17.960 | Is it in a car?
00:55:18.880 | Is it outside?
00:55:20.320 | All of that information which you're trying to get rid of normally,
00:55:23.720 | you actually want to learn about, so
00:55:28.400 | that you'll be able to generalize even to new tasks, for example.
00:55:31.440 | Maybe the next day I'm not gonna ask you to recognize phonemes, but
00:55:33.880 | recognize who's speaking.
00:55:35.120 | More generally, if we're able to disentangle these factors that explain
00:55:41.160 | how the data varies, everything becomes easy.
00:55:43.920 | Especially if those factors now can be generated in an independent way,
00:55:50.120 | we can to generate the data.
00:55:51.240 | For example, we can learn to answer a question that only depends on one or
00:56:00.360 | two factors, and basically we have eliminated all the other ones because we
00:56:02.880 | have separated them.
00:56:03.520 | So a lot of things become much easier.
00:56:06.400 | So that's one notion, right?
00:56:09.560 | We can disentangle factors.
00:56:10.760 | There's another notion, which is the notion of multiple levels of abstraction,
00:56:14.760 | which is of course at the heart of what we're trying to do with deep learning.
00:56:19.040 | And the idea is that we can have representations of the world,
00:56:25.040 | representation of the data as a description that involves factors,
00:56:32.320 | or features, and we can do that at multiple levels.
00:56:39.880 | And there are more abstract levels.
00:56:44.600 | So if I'm looking at a document, there's the level of the pixels,
00:56:49.000 | the level of the strokes, the level of the characters, the level of the words,
00:56:53.240 | and maybe the level of the meaning of individual words.
00:56:55.640 | And we actually have systems that will recognize from a scanned document
00:56:59.720 | all of these levels.
00:57:00.640 | When we go higher up, we're not sure what the right levels are, but
00:57:04.360 | clearly there must be representations of the meaning, not just of single words,
00:57:07.760 | but of sequences of words and the whole paragraph.
00:57:10.480 | What's the story?
00:57:11.960 | And why is it important to represent things in that way?
00:57:15.320 | Because higher levels of abstraction are representations
00:57:21.520 | from which it is much easier to do things, to answer questions.
00:57:26.360 | So the more semantic levels mean basically we can
00:57:31.000 | very easily act on the information when it's represented that way.
00:57:34.080 | If we think about the level of words, it's much easier to check whether a particular
00:57:38.680 | word is in the document if I have the words extracted than if I have to do it
00:57:42.080 | from the pixels.
00:57:44.160 | And if I have to answer a complicated question about the intention of the person,
00:57:49.080 | working at level of words is not high enough, it's not abstract enough.
00:57:51.800 | I need to work at a more abstract level in which maybe the same
00:57:56.880 | notion could be represented with many different types of words,
00:58:01.080 | where many different sentences could express the same meaning, and
00:58:03.920 | I wanna be able to capture that meaning.
00:58:05.800 | So the last slide I have is something that I've been working on in
00:58:13.920 | the last couple of years, which is connected to unsupervised learning,
00:58:21.640 | but more generally to the relationship between how
00:58:27.360 | we can build intelligent machines and the intelligence of humans or animals.
00:58:33.320 | And as you may know, this was one of the key
00:58:37.840 | motivations for doing neural nets in the first place.
00:58:41.720 | The intuition is this, that we are hoping that there are a few simple
00:58:49.080 | key principles that explain what allows us to be intelligent.
00:58:55.720 | And that if we can discover these principles,
00:58:59.360 | of course, we can also build machines that are intelligent.
00:59:02.000 | That's why the neural nets were inspired by things we know from the brain in
00:59:09.080 | the first place.
00:59:10.760 | We don't know if this is true, but if it is, then it's great.
00:59:18.280 | And I mean, this would make it much easier to understand how brains work,
00:59:22.320 | as well as building AI.
00:59:23.400 | So in trying to bridge this gap, because right now,
00:59:30.600 | our best neural nets are very, very different from what's going on in brains,
00:59:34.560 | as far as we can tell by talking to neuroscientists.
00:59:39.640 | In particular, backprop, although it's
00:59:45.960 | kicking ass from a machine learning point of view, it's not clear at all how
00:59:50.480 | something like this would be implemented in brains.
00:59:52.680 | So I've been trying to explore that, and also trying to see how we could
00:59:59.480 | generalize those credit assignment principles that would come out
01:00:03.680 | in order to also do unsupervised learning.
01:00:08.880 | So we've made a little bit of progress.
01:00:13.240 | A couple of years ago, I came up with an idea called target prop,
01:00:17.440 | which is a way of generalizing backprop to
01:00:23.440 | propagating targets for each layer.
01:00:28.080 | Of course, this idea has a long history.
01:00:34.040 | More recently, we've been looking at ways to
01:00:39.800 | implement gradient estimation in deep recurrent networks
01:00:44.840 | that perform some computation that turn out to end up with
01:00:52.320 | parameter updates corresponding to gradient descent in the prediction error
01:00:56.840 | that look like something that neuroscientists have been observing and
01:01:01.840 | don't completely understand called STDP, spike timing dependent plasticity.
01:01:05.440 | So I don't really have time to go into this, but I think this whole area
01:01:12.200 | of reconnecting neuroscience with machine learning and neural nets is
01:01:18.400 | something that has been kind of forgotten by the machine learning community because
01:01:22.480 | we're all so busy building self-driving cars.
01:01:24.640 | >> [LAUGH] >> But I think over the long term,
01:01:29.680 | it's a very exciting prospect.
01:01:34.840 | Thank you very much.
01:01:50.700 | >> To begin with, great talk.
01:02:01.540 | My question is regarding the lack of interlap between the results in
01:02:06.300 | the study of complex networks, like when they study the brain networks, right?
01:02:11.340 | There are a lot of publications that talk about the emergence of hubs, and
01:02:16.180 | especially a lot of publications on the degree distribution of
01:02:19.660 | the interneuron network.
01:02:21.020 | >> Right, right.
01:02:21.940 | >> But then when you look at the degree distribution of the so-called neurons in
01:02:26.740 | deep nets, you don't get to see the emergence of the hub behavior.
01:02:30.940 | So why do you think that there's such lack of overlap between the results?
01:02:36.100 | >> Because I think the hub story is maybe not that important.
01:02:41.620 | First of all, I really think that in order to understand the brain,
01:02:47.460 | you have to understand learning in the brain.
01:02:49.660 | And if we look at our experience in machine learning and deep learning,
01:02:55.580 | although the architecture does matter, what matters even more
01:03:00.940 | is the general principles that allow us to train these things.
01:03:04.220 | So I think the study of the connectivity makes sense.
01:03:11.140 | You can't have a fully connected thing and
01:03:14.620 | adding a way to have a short number of hubs to go from anywhere to anywhere is
01:03:18.540 | a reasonable idea, but I don't think it really explains that much.
01:03:26.060 | The central question is, how does the brain learn complicated things?
01:03:30.500 | And it does it better than our current machines,
01:03:35.820 | yet we don't know even a simple way of training
01:03:42.660 | brains that at least fits the biology reasonably.
01:03:45.740 | Yeah?
01:03:48.820 | >> There any cases of real world examples where the curse of
01:03:52.900 | dimensionality is still a problem for neural nets?
01:03:55.140 | >> Yeah, anytime it doesn't work.
01:03:58.940 | >> [LAUGH] >> I mean,
01:04:01.100 | from a generalization point of view.
01:04:02.820 | So Andrew told us yesterday that we can just add more data and
01:04:11.260 | computing power and for some problems this may work.
01:04:14.580 | But sometimes the amount of data you would need is just too large
01:04:19.620 | with our current techniques.
01:04:21.700 | And we'll need also to develop, how did you call it, the Hail Mary.
01:04:29.300 | All right?
01:04:30.540 | We also need to do some research on the algorithms and
01:04:33.500 | the architectures to be able to learn about how the world is organized so
01:04:40.140 | that we can generalize in much more powerful ways.
01:04:43.220 | And that is needed because the kind of task we want to solve involve many,
01:04:51.060 | many variables that have an exponentially large number of possible values.
01:04:54.820 | And that's the curse of dimensionality essentially.
01:04:57.300 | So it's facing pretty much all of the AI problems around us.
01:05:02.020 | >> Hi, I have a question on multi-agent reinforcement learning.
01:05:08.380 | >> Yeah.
01:05:09.100 | >> If you assume all cars can never predict all possible potential accidents,
01:05:15.140 | what about the potential for transfer learning and things like that?
01:05:19.940 | >> Yeah, so I was giving an example of a single human learning how to drive.
01:05:25.220 | We might be able to use the millions of people using self-driving cars,
01:05:31.460 | correcting, and some of them making accidents to actually make some progress
01:05:35.220 | without actually solving the hard problems.
01:05:37.660 | And this is probably what you're going to be doing for a while.
01:05:40.020 | But, and we should do it.
01:05:42.940 | We should definitely use all the data we have.
01:05:45.140 | Currently, if you look at the amount of data we're using for
01:05:47.300 | speech recognition or language modeling,
01:05:49.460 | it's hugely more than what any human actually sees in their lifetime.
01:05:54.660 | So we're doing something wrong.
01:05:57.180 | And we could do better with less data.
01:06:01.780 | And babies and kids can do it.
01:06:11.820 | >> Well, so one thing that strikes me is that most of this training,
01:06:15.180 | like for images, it's done on static images.
01:06:18.900 | >> Well, there's quite a bit of work on video these days.
01:06:21.500 | >> Okay.
01:06:22.180 | >> It's mostly a computational bottleneck.
01:06:24.220 | >> I mean, I've seen dogs generated by GANs, right?
01:06:29.540 | >> Yeah.
01:06:30.460 | Well, keep in mind we were doing MNIST just a couple of years ago.
01:06:33.100 | >> Yeah, but if you see a dog [INAUDIBLE]
01:06:39.020 | >> Yeah, absolutely.
01:06:40.540 | Yeah, I don't think it's a fundamental issue.
01:06:44.140 | If we're able to do it well on static images,
01:06:48.500 | the same principles will allow us to do sequences.
01:06:52.140 | We're already doing sequential things.
01:06:55.260 | For example, an interesting project is speech synthesis with recurrent nets and
01:07:01.780 | stuff like that, or convolutional nets, whatever.
01:07:04.180 | So it's more like we're not sure how to train them well and
01:07:10.260 | how to discover these explanatory factors and so on.
01:07:13.980 | That's my view.
01:07:14.660 | Yeah? >> I have a question,
01:07:18.340 | maybe non-technical.
01:07:19.580 | So we have seen the human error rates versus our algorithms error rates for
01:07:25.220 | things that we are used to, like image recognition, speech recognition.
01:07:29.260 | >> Right, right.
01:07:29.860 | >> So has there ever been an experiment where we try to train humans for
01:07:33.980 | things that we are not used to?
01:07:35.620 | >> Right. >> And not train the machine at
01:07:37.660 | the same time and see.
01:07:39.380 | >> Right. >> So how capable are algorithms?
01:07:41.540 | >> You're asking if these experiments have been done?
01:07:43.340 | >> Yeah, yeah.
01:07:45.140 | >> I don't know, but I'm sure the humans would beat the hell out of the machines,
01:07:49.340 | for now.
01:07:49.840 | For this kind of thing, humans are able to learn a new task or
01:07:55.340 | new concepts from very few examples.
01:07:58.020 | And we know that in order for machines to do as well,
01:08:01.420 | they just need more sort of common sense, right?
01:08:04.660 | More general knowledge of the world.
01:08:06.220 | This is what allows humans to learn so quickly on a few examples.
01:08:09.620 | Yeah?
01:08:12.260 | >> You presented experimental data where you showed that lots of local minima for
01:08:15.980 | these parameters, or maybe saddle points.
01:08:18.220 | >> Saddle points.
01:08:19.220 | >> Have similar performance.
01:08:20.820 | >> Yeah. >> Are these saddle points-
01:08:22.260 | >> Well, no, the local minima,
01:08:23.260 | that's the local minima, yeah.
01:08:24.700 | >> Are these local minima separated widely in parameter space, or are they close by?
01:08:29.860 | >> That's a good question, I could- >> And
01:08:31.620 | I guess a related question is, once you've trained the network,
01:08:34.060 | if there are lots of local minima, does that suggest that you could compress
01:08:38.900 | the network and represent it with far fewer parameters?
01:08:41.420 | >> Maybe.
01:08:47.100 | So for your first question, we have some experiments dating from 2009,
01:08:51.940 | where we try to visualize in 2D the trajectories of training.
01:08:58.420 | So this is a paper, first author is Dmitri Aron, former PhD students with me,
01:09:03.820 | where we wanted to see how, depending on where you start,
01:09:09.260 | where do you end up?
01:09:12.100 | Do different trajectories end up in the same place?
01:09:15.660 | Or do they all go in a different place?
01:09:16.940 | Turns out they all go in a different place.
01:09:19.260 | And so the number of local minima is much larger than the number of trajectories
01:09:22.780 | that we tried, like 500 or 1,000.
01:09:26.340 | It's so much larger that no two random initial Cs end up near each other.
01:09:32.460 | So it looks like there's a huge number of local minima,
01:09:35.580 | which is in agreement with the theory that there's an exponential number of them.
01:09:39.100 | But the good news is they're all kind of equivalent in terms of cost,
01:09:42.340 | if you have a large network.
01:09:44.580 | >> Is that just compressibility at all, or how can you?
01:09:52.100 | >> I'm not sure.
01:09:53.380 | I'm sure there are many ways to compress these networks.
01:09:56.540 | There's a lot of redundancy in many ways.
01:09:58.460 | There are redundancies due to the numbering.
01:10:05.140 | Like you could flip all, take that unit, put it here, take that unit, put it here,
01:10:11.180 | and so on.
01:10:13.540 | But I don't think you're going to gain a lot of bits from that.
01:10:15.420 | >> So we've talked about that one of the main advantages of deep learning is that
01:10:22.580 | it can work with lots of data.
01:10:25.260 | But you were mentioning before that we need also to
01:10:29.460 | capture the ability of humans of working with fewer data.
01:10:33.060 | >> Yeah, but the reason we're able to work with fewer data is because we have first
01:10:37.780 | learned from a lot of data about the general knowledge of the world.
01:10:43.260 | >> Right, so how can we adapt neural networks to
01:10:48.660 | bring us to this new few data paradigm?
01:10:51.660 | >> We have to do a lot better at unsupervised learning, and
01:10:57.140 | of the kind that really discovers sort of explanations about the world.
01:11:01.020 | That's what I think.
01:11:01.580 | >> Okay, let's thank Yeshua again.
01:11:09.260 | >> [APPLAUSE]
