back to indexFoundations and Challenges of Deep Learning (Yoshua Bengio)
Chapters
0:0
3:34 Bypassing the curse of dimensionality We need to build compositionality into our ML models Just as human languages exploit compositionality to give representations and meanings to complex ideas
9:56 The need for distributed representations Clustering
12:56 Each feature can be discovered without the need for seeing the exponentially large number of configurations of the other features Consider a network whose hidden units discover the following features: - Person wears glasses
24:30 Exponential advantage of depth
35:15 Attention Mechanism for Deep Learning . Consider an input for intermediate sequence or image . Consider an upper level representation, which can choose
36:48 Attention Mechanisms for Memory Access Enable Reasoning
56:12 Learning Multiple Levels of Abstraction
58:11 Towards Key Principles of Learning for both Machines and Brains
00:00:04.320 |
So I'll tell you about some very high level stuff today and 00:00:12.120 |
Some of you already know about the book that Ian Goodfellow, 00:00:23.800 |
I think you can find it on Amazon or something. 00:00:28.880 |
And the paper, the actual shipping is gonna be in December, hopefully for NIPS. 00:00:41.920 |
well from several people here, at least from Andrew I think. 00:00:46.720 |
But it's good to ponder a little bit some of these ingredients that seem to be 00:00:57.720 |
But in general for machine learning to succeed, to learn really complicated tasks 00:01:02.080 |
of the kind we want to reach human level performance. 00:01:10.520 |
it's going to need to acquire a lot of information about the world. 00:01:17.120 |
And the big success of machine learning for AI has been to show that 00:01:22.680 |
we can provide that information through data, through examples. 00:01:29.360 |
that machine will need to know a huge amount of information about the world around us. 00:01:33.680 |
This is not how we are doing it now because we're not able to train such big 00:01:39.440 |
And so we'll need models that are much bigger than the ones we currently have. 00:01:42.880 |
Of course, that means machine learning algorithms that can represent 00:01:48.000 |
complicated functions, that's one good thing about neural nets. 00:01:51.360 |
But there are many other machine learning approaches that allow you in principle 00:01:55.440 |
to represent very flexible forms like non-parametric methods, 00:02:10.840 |
0.3 of course, you need enough computing power to train and 00:02:19.000 |
And 0.5 just says that it's not enough to be able to train the model, 00:02:24.280 |
you have to be able to use it in a reasonably efficient way from a computational 00:02:29.840 |
This is not always the case with some probabilistic models where inference, 00:02:34.080 |
in other words, answering questions, having the computer do something, 00:02:37.920 |
can be intractable and then you need to do some approximations, 00:02:44.080 |
Now, the point I really want to talk about is the fourth one, 00:02:47.160 |
how do we defeat the curse of dimensionality? 00:02:51.440 |
In other words, if you don't assume much about the world, 00:02:58.520 |
And so I'm gonna tell you a bit about the assumptions 00:03:06.120 |
that are behind a lot of deep learning algorithms which make it possible to 00:03:10.480 |
work as well as we are seeing in practice in the last few years. 00:03:30.800 |
so how do we bypass the curse of dimensionality? 00:03:34.200 |
The curse of dimensionality is about the exponentially large number of 00:03:39.920 |
configurations of the space variables that we want to model. 00:03:44.120 |
The number of values that all of the variables that we observe can take 00:03:48.800 |
is gonna be exponentially large in general because there's a compositional nature. 00:03:55.520 |
If each pixel can take two values and you got a million pixels, 00:03:58.360 |
then you got two to one million number of possible images. 00:04:01.240 |
So the only way to beat an exponential is to use another exponential. 00:04:11.800 |
We need to build our models in such a way that they can represent 00:04:18.680 |
But yet, these models need to have a reasonably small number of parameters. 00:04:26.440 |
Reasonably small in the sense that compared to the number of 00:04:30.240 |
configurations of the variables, the number of parameters should be small. 00:04:34.080 |
And we can achieve that by composing little pieces together, 00:04:40.720 |
composing layers together, composing units on the same layer together. 00:04:45.600 |
And that's essentially what's happening with deep learning. 00:04:48.000 |
So you actually have two kinds of compositions. 00:04:50.840 |
There's the compositions happening on the same layer. 00:04:54.080 |
This is the idea of distributed representations, 00:05:00.120 |
This is what you get when you learn embeddings for words or for images, 00:05:06.400 |
And then there's the idea of having multiple levels of representation. 00:05:12.840 |
And there, there is another kind of composition that takes place, 00:05:16.560 |
whereas the first one is a kind of parallel composition. 00:05:19.720 |
I can choose the values of my different units separately, and 00:05:23.080 |
then they together represent an exponentially large number of possible 00:05:28.760 |
In the second case, there's a sequential composition where I take the output of 00:05:32.240 |
one level and I combine them in new ways to build features for 00:05:45.800 |
is because the world around us is better modeled by making these assumptions. 00:05:53.040 |
It's not necessarily true that deep learning is gonna work for 00:05:57.440 |
In fact, if we consider the set of all possible distributions that we would like 00:06:01.000 |
to work from, deep learning is no better than any other. 00:06:06.120 |
And that's basically what the no free lunch theorem is saying. 00:06:09.880 |
It's because we are incredibly lucky that we live in this world, 00:06:14.040 |
which can be described by using composition, that these algorithms are working so well. 00:06:23.160 |
So before I go a bit more into distributed representations, 00:06:31.720 |
let me say a few words about non-distributed representations. 00:06:34.000 |
So if you're thinking about things like clustering, N-grams for 00:06:38.360 |
language modeling, classical nearest neighbors, SVMs with Gaussian kernels, 00:06:45.120 |
classical non-parametric models with local kernels, decision trees, 00:06:52.280 |
all these things, the way these algorithms really work is actually pretty 00:06:57.960 |
straightforward if you cut the crap and hide the math and 00:07:12.960 |
And they're gonna use different free parameters for 00:07:17.080 |
each of those regions to figure out what the right answer should be. 00:07:20.240 |
The right answer, it doesn't have to be supervised learning. 00:07:21.920 |
Even in unsupervised learning, there's a right answer. 00:07:23.600 |
It might be the density or something like that. 00:07:25.200 |
Okay, and you might think that that's the only way of solving a problem. 00:07:31.360 |
We consider all of the cases and we have an answer for each of the cases. 00:07:36.160 |
And we can maybe interpolate between those cases that we've seen. 00:07:39.160 |
The problem with this is somebody comes up with a new example which isn't 00:07:46.520 |
in between two of the examples we've seen, something that requires us to extrapolate. 00:07:51.520 |
Something that's a non-trivial generalization. 00:07:58.360 |
saying something meaningful away from the training examples. 00:08:02.240 |
There's another interesting thing to note here, 00:08:07.000 |
which I would like you to keep in mind before I show the next slide, 00:08:10.400 |
which is in red here, which is we can do a kind of simple counting 00:08:16.880 |
to relate the number of parameters, the number of free parameters that can be 00:08:21.480 |
learning, and the number of regions in the data space that we can distinguish. 00:08:28.440 |
So here, we basically have a linear relationship between these two things. 00:08:33.440 |
So for each region, I'm gonna need at least something like some kind of 00:08:38.960 |
center for the region, and maybe if I need to output something, 00:08:42.120 |
I'll need an extra set of parameters to tell me what the answer should be in that area. 00:08:48.040 |
So the number of parameters grows linearly with the number of regions 00:08:55.480 |
The good news is I can have any kind of function, right? 00:08:58.320 |
So I can break up the space in any way I want, and 00:09:00.520 |
then for each of those regions, I can have any kind of output that I need. 00:09:03.800 |
So for decision trees, the regions would be splitting across axes and so on, and 00:09:10.840 |
for, this is more like for nearest neighbor or something like that. 00:09:47.960 |
Okay, so here's the point of view of distributed 00:09:53.320 |
representations for solving the same general machine learning problem. 00:10:00.520 |
We have a data space and we wanna break it down, but 00:10:03.800 |
we're gonna break it down in a way that's not general. 00:10:07.680 |
We're gonna break it down in a way that makes assumptions about the data, but 00:10:12.840 |
it's gonna be compositional and it's going to allow us to be exponentially more efficient. 00:10:19.360 |
So in the picture on the right, what you see is a way to break 00:10:24.160 |
the input space by the intersection of half planes. 00:10:27.960 |
And this is the kind of thing you would have with what happens at the first layer 00:10:33.360 |
So here, imagine the input is two dimensional, so I can plot it here, and 00:10:37.240 |
I have three binary hidden units, C1, C2, C3. 00:10:41.960 |
So because they're binary, you can think of them as little binary classifiers. 00:10:50.880 |
you can think of what they're doing as a linear classification. 00:10:55.040 |
And so those colored hyperplanes here are the decision surfaces for each of them. 00:11:01.000 |
Now, these three bits, they can take eight values, right, 00:11:06.320 |
corresponding to whether each of them is on or off. 00:11:10.160 |
And those different configurations of those bits correspond to 00:11:14.320 |
actually seven regions here, because there's one of the eight regions which 00:11:20.680 |
So now you see that we are defining a number of regions which is corresponding 00:11:27.520 |
to all of the possible intersections of the corresponding half planes. 00:11:31.160 |
And now we can play the game of how many regions do we get for 00:11:39.240 |
And what we see is that if we played the game of growing the number of 00:11:43.920 |
dimensions, of features, and also of inputs, we can get an exponentially large 00:11:50.000 |
number of regions, which are all of these intersections, right? 00:11:52.600 |
There's an exponential number of these intersections 00:11:55.080 |
corresponding to different binary configurations. 00:11:59.640 |
Yet the number of parameters grows linearly with the number of units. 00:12:02.800 |
So it looks like we're able to express a function. 00:12:07.040 |
And then on top of that, I could imagine you have a linear classifier, right? 00:12:13.360 |
So the number of parameters grows just linearly with the number of features. 00:12:20.880 |
But the number of regions that the network can really provide a different 00:12:31.000 |
And the reason it's very cool is that it allows those neural nets to generalize. 00:12:36.080 |
Because while we're learning about each of those features, 00:12:40.760 |
we can generalize to regions we've never seen because we've learned enough 00:12:50.360 |
I'm going to give you an example of this in a couple of slides. 00:12:57.560 |
So think about those features, let's say the input is an image of a person. 00:13:07.160 |
I have a detector that says that the person wears glasses. 00:13:11.240 |
And I have another unit that's detecting that the person is a female or male. 00:13:16.800 |
And I have another unit that detects that the person is a child or not. 00:13:20.080 |
And you can imagine hundreds or thousands of these things, of course. 00:13:23.120 |
So the good news is you could imagine learning about each of these 00:13:34.000 |
feature detectors, these little classifiers, separately. 00:13:40.200 |
You could share intermediate layers between the input and those features. 00:13:44.520 |
But let's take even the worst case and imagine we were to train those separately, 00:13:48.840 |
which is the case in the linear model that I showed before. 00:13:51.800 |
We have a separate set of parameters for each of these detectors. 00:13:55.600 |
So if I have n features, each of them, say, needs order of k parameters. 00:14:01.680 |
Then I need order of nk parameters, and I need order of nk examples. 00:14:07.240 |
And one thing you should know from machine learning theory is that 00:14:18.880 |
you need order of p examples to do a reasonable job of generalizing. 00:14:28.880 |
But to keep things simple, you need about the same number of examples, or 00:14:34.640 |
10 times more, as the number of really free parameters. 00:14:38.240 |
So now the relationship between the number of regions that I can represent and 00:14:47.680 |
the number of examples I need is quite nice because the number of regions is 00:14:52.560 |
going to be two to the number of features of these binary features. 00:14:57.040 |
So a person could wear glasses or not, be a female or a male, a child or not, and 00:15:03.120 |
And I could probably recognize reasonably well all of these 2 to 00:15:10.440 |
even though I've obviously not seen all of those 2 to 100 configurations. 00:15:18.120 |
I'm able to do that because the models can learn about each of these binary 00:15:22.280 |
features kind of independently in the sense that I don't need to see 00:15:26.160 |
every possible configuration of the other features to know about wearing glasses. 00:15:32.640 |
I can learn about wearing glasses even though I've never seen 00:15:37.680 |
somebody who was a female and a child and chubby and had yellow shoes. 00:15:45.400 |
And I have seen enough examples of people wearing glasses, 00:15:49.920 |
I can learn about wearing glasses in general. 00:15:52.600 |
I don't need to see all of the configurations of the other features to 00:16:03.600 |
is because we're making assumptions about the data 00:16:08.240 |
that those features are meaningful by themselves. 00:16:11.400 |
And you don't need to actually have data for each of the regions, 00:16:15.960 |
the exponential number of regions, in order to learn 00:16:20.520 |
the proper way of detecting or of discovering these intermediate features. 00:16:30.760 |
There were some experiments recently actually showing that 00:16:37.920 |
Because the features I was talking about, not only I'm assuming that they exist, 00:16:46.040 |
but the optimization methods, the training procedures discover them. 00:16:51.640 |
And this is an experiment that's been done in Antonio Toralba's lab at MIT, 00:17:00.800 |
where they trained a usual ConvNet to recognize places. 00:17:06.880 |
So the outputs of the net are just the types of places, 00:17:10.880 |
like is this a beach scene or an office scene or a street scene and so on? 00:17:15.120 |
But then the thing they've done is they asked people to analyze the hidden units 00:17:20.480 |
to try to figure out what each hidden unit was doing. 00:17:22.600 |
And they found that there's a large proportion of units that humans can find 00:17:27.040 |
a pretty obvious interpretation for what those units like. 00:17:29.800 |
So they see a bunch of units which like people or 00:17:37.080 |
different kinds of people or animals or buildings or seatings or tables, 00:17:42.600 |
So it's like if indeed those neural nets are discovering semantic features. 00:17:48.720 |
They're semantic because actually people give them names as the intermediate 00:17:53.440 |
features in order to reach the final goal of here, classifying scenes. 00:17:57.640 |
And the reason they're generalizing is because now you can combine those 00:18:03.320 |
features in an exponentially large number of ways. 00:18:05.560 |
You could have a scene that has a table, a different kind of lighting, 00:18:13.960 |
And you can say something meaningful about the combinations of these things. 00:18:19.160 |
Because the network is able to learn all of these features without having to 00:18:24.040 |
see all of the possible configurations of them. 00:18:26.160 |
So I don't know if my explanation makes sense to you, but 00:18:41.480 |
>> So with one decision piece you can kind of do this the same as well, right? 00:18:52.840 |
So the question is, can't we do the same thing with a set of decision trees? 00:18:57.640 |
Yeah, in fact, this is one of the reasons why forests work better or 00:19:04.920 |
Forests are actually, or bagged trees are one level deeper than a single trees. 00:19:11.280 |
But they still don't have as much of a sort of distributed aspect as neural nets. 00:19:23.640 |
I mean, boosted trees are, to some extent, in a greedy way. 00:19:32.180 |
>> Do you find cases that are non-compositional? 00:19:51.100 |
>> You're talking about compositionality here. 00:19:55.780 |
I don't think that there are examples of neural nets that really work well where 00:19:59.420 |
the data doesn't have some kind of compositional structure in it. 00:20:02.140 |
But if you come up with an example, I'd like to hear about it. 00:20:12.340 |
do you mean that we're facing a model of this rock? 00:20:16.780 |
And in the real world, we're trying to look for some independent, 00:20:21.380 |
but we cannot get independent, but it starts somewhere with a very small square. 00:20:26.020 |
>> To think about this issue in graphical model terms can be done. 00:20:35.500 |
But you have to think about not feature detection, 00:20:39.540 |
like I've been doing here, but about generating an image or something like that. 00:20:46.620 |
So the same kinds of things happen if you think about how I could generate an image. 00:20:52.740 |
If you think about underlying factors like which objects, where they are, 00:20:56.820 |
what's their identity, what's their size, these are all independent factors, 00:21:04.900 |
If you were to do a graphics engine, you can see exactly what those ways are. 00:21:07.580 |
And it's much, much easier to represent that joint of distribution 00:21:14.500 |
using this compositional structure than if you're trying to 00:21:19.060 |
work directly in pixel space, which is normally what you would do with 00:21:23.620 |
a classical non-parametric method, and it wouldn't work. 00:21:27.100 |
But if you look at our best deep generative models now for images, for 00:21:30.420 |
example, like GANs or VAEs, they're really, we're not there yet, but 00:21:37.140 |
they're amazingly better than anything that people could dream of just a few years ago 00:21:42.180 |
Okay, let me move on, because I have other things to talk about. 00:21:52.660 |
some people have done some math around these ideas. 00:21:57.020 |
And so for example, there's one result from two years ago, 00:22:05.260 |
right here, where we studied the single layer case. 00:22:20.860 |
of course, computes a piecewise linear function. 00:22:26.300 |
And so one way to quantify the richness of the function that it can compute, 00:22:31.340 |
I was talking about regions here, but well, you can do the same thing here. 00:22:35.580 |
You can count how many pieces does this network have in its input to output function. 00:22:46.540 |
in the number of inputs, well, it's number of units to the power number of inputs. 00:22:54.820 |
So that's for sort of distributed representation, 00:23:10.980 |
there's a lot of earlier theory that says that a single layer is sufficient 00:23:17.100 |
However, that theory doesn't specify how many units you might need. 00:23:20.140 |
And in fact, you might need an exponentially large number of units. 00:23:24.860 |
So what several results show is that there are functions 00:23:32.220 |
that can be represented very efficiently with few units, so few parameters. 00:23:41.980 |
So out of all the functions, again, it's a luckiness thing, right? 00:23:52.100 |
which happened to be very easy to represent with a deep network. 00:23:56.260 |
And if you try to represent these functions with a shallow network, 00:24:04.260 |
You're gonna need an exponential number of parameters. 00:24:08.180 |
And so you're gonna need an exponential number of examples to learn these things. 00:24:11.700 |
But again, we're incredibly lucky that the function we want to learn 00:24:19.980 |
I mean, we use this kind of compositionality and depth everywhere. 00:24:23.180 |
When we write a computer program, we just don't have a single main. 00:24:29.180 |
And we were able to show similar things as what I was telling you about for 00:24:34.940 |
the single layer case, that as you increase depth for 00:24:40.300 |
these deep ReLU networks, the number of pieces in the piecewise 00:24:46.100 |
linear function grows exponentially with the depth. 00:24:48.300 |
So it's already exponentially large with a single layer, but 00:24:53.340 |
it gets exponentially even more with a deeper net. 00:24:58.340 |
Okay, so this was a topic of representation of functions. 00:25:03.620 |
Why deep learn, deep architectures can be very powerful if we're lucky, 00:25:10.820 |
Another topic I wanna mention that's kind of very much in the foundations is 00:25:21.060 |
how is it that we're able to train these neural nets in the first place? 00:25:25.300 |
In the 90s, many people decided to not do any more research on neural nets, 00:25:31.540 |
because there were theoretical results showing that there are really 00:25:35.980 |
an exponentially large number of local minima in the training objective of a neural net. 00:25:43.660 |
So in other words, the function we wanna learn has many of these holes, 00:25:49.820 |
and if we start at a random place, well, what's the chance we're gonna find 00:25:54.340 |
the best one, the one that corresponds to a good cost? 00:25:58.740 |
And that was one of the motivations for people who flocked into a very large area 00:26:05.140 |
of research in machine learning in the 90s and 2000s, 00:26:08.420 |
based on algorithms that require only convex optimization to train. 00:26:13.340 |
Cuz of course, if we can do convex optimization, we eliminate this problem. 00:26:17.460 |
If the objective function is convex in the parameters, 00:26:20.140 |
then we know there's a single global minimum. 00:26:24.620 |
Right, so let me show you a picture here, you get a sense of, 00:26:30.380 |
if you look on the right hand top, this is, if you draw a random function in 1D or 00:26:37.300 |
2D or 3D, like here is kind of a random smooth function in 2D, 00:26:41.540 |
you see that it's gonna have many ups and downs. 00:26:49.620 |
But the good news is that in high dimension, it's a totally different story. 00:26:57.220 |
We're talking about the parameters of the model, and 00:27:00.020 |
the vertical axis is the cost that we're trying to minimize. 00:27:03.740 |
And what happens in high dimension is that instead of having 00:27:08.860 |
a huge number of local minima on our way when we're trying to optimize, 00:27:14.820 |
what we encounter instead is a huge number of saddle points. 00:27:18.620 |
So saddle point is like the thing on the bottom right in 2D. 00:27:23.540 |
So you have two parameters and the y-axis is the cost you wanna minimize. 00:27:27.140 |
And so what you see in a saddle point is you have dimensions or 00:27:30.980 |
directions where the objective function draws a minimum. 00:27:45.060 |
So saddle point has both a minimum in some direction and 00:27:49.860 |
So this is interesting because even though it's a, 00:27:57.780 |
these points, like saddle points and minima are places where you could get stuck. 00:28:05.260 |
In principle, if you're exactly at the saddle point, you don't move. 00:28:08.100 |
But if you move a little bit away from it, you will go down the saddle, right? 00:28:23.700 |
collaborators of Yan LeCun showed is that actually 00:28:29.340 |
in very high dimension, not only it's the issue is more saddle points than local minima. 00:28:43.060 |
So let me try to explain what I mean by this. 00:28:45.140 |
So let me show you actually first an experiment from the NYU guys. 00:28:54.940 |
So they did an experiment where they gradually changed the size of the neural net. 00:29:01.180 |
And they look at what looks like local minima, but 00:29:04.580 |
they could be saddle points that are the lowest that they could obtain by training. 00:29:09.660 |
And what you're looking at is a distribution of 00:29:13.820 |
errors they get from different initializations of their training. 00:29:17.860 |
And so what happens is that when the network is small, 00:29:21.820 |
like the pink here on the right, there's a widespread distribution of 00:29:26.540 |
cost that you can get depending on where you start and you're pretty high. 00:29:31.700 |
And if you increase the size of the network, it's like all of the local 00:29:36.740 |
minima that you find concentrate around a particular cost. 00:29:42.300 |
So you don't get any of these bad local minima that you would get with a small 00:29:50.140 |
And if you increase even more the size of the network, 00:29:51.700 |
this is like a single hidden layer network, not very complicated. 00:29:58.060 |
In other words, they all kind of converge to the same kind of cost. 00:30:04.860 |
So if we go back to the picture of the saddle point, but 00:30:08.980 |
instead of being in 2D, imagine you are in a million D. 00:30:12.460 |
And in fact, people have billion D networks these days. 00:30:16.420 |
I'm sure Andrew has even bigger ones, I'm sure. 00:30:25.340 |
high dimensional space of parameters is that, 00:30:35.620 |
if you imagine a little bit of randomness in the way the problem is set up, and 00:30:39.300 |
it seems to be the case, in order to have a true local minimum, 00:30:45.180 |
you need to have the curvature going up like this in all the billion directions. 00:30:51.540 |
So if there is a certain probability of this event happening, 00:30:56.700 |
that this particular direction is curving up and this one is curving up, 00:31:00.020 |
the probability that all of them curve up becomes exponentially small. 00:31:09.100 |
What you see in the bottom left is a curve that shows the training error 00:31:16.220 |
as a function of what's called the index of the critical point, 00:31:36.700 |
100% would be it's a local maximum, and anything in between is a saddle point. 00:31:43.060 |
So what we find is that as training progresses, 00:31:49.340 |
we're going close to a bunch of saddle points, and 00:31:53.460 |
none of them are local minima, otherwise we would be stuck. 00:31:58.060 |
And in fact, we never encounter local minima until 00:32:04.380 |
we reach the lowest possible cost that we're able to get. 00:32:08.500 |
In addition, there is a theory suggesting that, so 00:32:14.500 |
the local minima will actually be close in cost to the global minimum. 00:32:25.420 |
they will concentrate in a little band above the global minimum. 00:32:28.420 |
But that band of local minima will be close to the global minimum. 00:32:36.020 |
And the larger the dimension, the more this is gonna be true. 00:32:42.420 |
At some point, of course, you will get local minima, 00:32:45.660 |
even though it's unlikely when you're in the middle. 00:32:48.660 |
When you get close to the bottom, well, you can't go lower. 00:33:00.340 |
I don't think that the optimization problem of neural nets is solved. 00:33:03.660 |
There are still many cases where we find ourselves to be stuck. 00:33:07.500 |
And we still don't understand what the landscape looks like. 00:33:10.900 |
There's a set of beautiful experiments by Ian Goodfellow 00:33:13.660 |
that help us visualize a bit what's going on. 00:33:16.540 |
But I think one of the open problems of optimization for 00:33:18.940 |
neural nets is, what does the landscape actually look like? 00:33:23.300 |
It's hard to visualize, of course, because it's very high dimensional. 00:33:26.060 |
But for example, we don't know what those saddle points really look like. 00:33:32.420 |
When we actually measure the gradient near those, 00:33:37.580 |
when we are approaching those saddle points, it's not close to zero. 00:33:42.940 |
This may be due to the fact that we're using SGD and 00:33:47.340 |
There might be conditioning issues where even if you are at a saddle, 00:33:51.220 |
near a saddle point, you might be stuck, even though it's not a local minimum. 00:33:54.420 |
Because in many directions, it's still going up, 00:34:01.580 |
And the other directions are hard to reach because simply, 00:34:06.500 |
there's a lot more curvature in some directions than other directions. 00:34:09.460 |
And that's the traditional ill conditioning problem. 00:34:13.700 |
We don't know exactly what's making it hard to train some networks. 00:34:20.940 |
But when you go into things like machine translation or 00:34:23.780 |
even worse, reasoning tasks with things like neural training machines and 00:34:27.900 |
things like that, it gets really, really hard to train these things. 00:34:30.580 |
And people have to use all kinds of tricks like curriculum learning, 00:34:33.700 |
which are essentially optimization tricks, to make the optimization easier. 00:34:37.940 |
So I don't want to tell you that, the optimization problem of neural nets is 00:34:42.980 |
easy, it's done, we don't need to worry about it. 00:34:47.020 |
less of a concern than what people thought in the 90s. 00:34:56.740 |
So machine learning, I mean, deep learning is moving out of pattern recognition and 00:35:05.500 |
into more complicated tasks, for example, including reasoning and 00:35:09.900 |
combining deep learning with reinforcement learning, planning, and 00:35:16.540 |
That's one of the tools that is really, really useful for many of these tasks. 00:35:28.580 |
mechanisms as not a way to focus on what's going on in the outside world. 00:35:33.820 |
Like we usually think of attention like attention in the visual space, but 00:35:38.900 |
In the space of representations that have been built. 00:35:41.460 |
So that's what we do here in machine translation. 00:35:45.020 |
And it's been extremely successful, as Quark said. 00:35:49.420 |
So I'm not gonna show you any of these pictures, blah, blah, blah. 00:35:55.060 |
So I'm getting more now into the domain of challenges. 00:36:00.620 |
A challenge that I've been working on since I was a baby researcher as a PhD 00:36:05.020 |
student is long term dependencies and recurrentness. 00:36:13.500 |
this is still something that we haven't completely cracked. 00:36:18.020 |
And it's connected to the optimization problem that I told you before, but 00:36:21.980 |
it's a very particular kind of optimization problem. 00:36:26.180 |
So some of the ideas that we've used to try to make 00:36:31.660 |
the propagation of information and gradients easier include 00:36:37.100 |
using skip connections over time, include using multiple time scales. 00:36:42.660 |
There's some recent work in this direction from my lab and other groups. 00:36:50.420 |
you can think of a way to help dealing with long term dependency. 00:37:02.780 |
the place on which we're putting attention as part of the state. 00:37:06.060 |
So imagine really you have a recurrent net and it has two kinds of state. 00:37:12.500 |
It has the usual recurrent net state, but it has the content of the memory. 00:37:17.620 |
Kwok told you about memory nets and neural train machines. 00:37:20.340 |
And the full state really includes all of these things. 00:37:24.140 |
And now we're able to read or write from that memory. 00:37:29.660 |
I mean, the little recurrent net is able to do that. 00:37:32.060 |
So what happens is that there are memory elements 00:37:38.460 |
which don't change over time, maybe they've been written once. 00:37:43.740 |
And so the information that has been stored there, it can stay for 00:37:48.180 |
as much time as they're not gonna be overwritten. 00:37:51.700 |
So that means that if you consider the gradients back 00:37:57.060 |
propagated through those cells, they can go pretty much unhampered and 00:38:03.420 |
So this is something that could be, that view of the problem of long term 00:38:09.380 |
dependencies with memory I think could be very useful. 00:38:13.700 |
All right, in the last part of my presentation, I wanna tell you about 00:38:17.780 |
what I think is the biggest challenge ahead of us, which is unsupervised learning. 00:38:21.860 |
Any question about attention and memory before I move on to unsupervised learning? 00:38:27.620 |
Okay, so why do we care about unsupervised learning? 00:38:43.940 |
Actually, it's working a lot better than it was, but 00:38:46.220 |
it's still not something you find in industrial products. 00:38:51.540 |
There are less obvious ways where unsupervised learning is actually already 00:38:56.860 |
So for example, when you train word embeddings with Word2Vec or 00:38:59.700 |
any other model and you use that to pre-train, 00:39:02.500 |
like we did our machine translation systems or other kinds of NLP tasks. 00:39:08.900 |
Even when you train a language model that you're gonna stick in some other thing or 00:39:15.460 |
pre-train something with that, you're also doing unsupervised learning. 00:39:26.900 |
the importance of unsupervised learning is usually underrated. 00:39:36.580 |
First of all, the idea of unsupervised learning is that we can train, 00:39:39.580 |
we can learn something from large quantities of unlabeled data that humans 00:39:46.940 |
Humans are very good at learning from unlabeled data. 00:39:53.500 |
I have an example that I use often that makes it very, very clear that, 00:40:02.540 |
for example, children can learn all kinds of things about the world, 00:40:11.220 |
anything about it until much later when it's too late. 00:40:17.420 |
So a two or three year old understands physics. 00:40:22.300 |
If she has a ball, she knows what's gonna happen when she drops the ball. 00:40:30.700 |
She knows all kinds of things about objects and ordinary Newtonian physics, 00:40:37.020 |
even though she doesn't have explicit equations and a way to describe them 00:40:40.500 |
with words, but she can predict what's gonna happen next, right? 00:40:45.300 |
And the parents don't tell the children, force equals mass times acceleration. 00:40:55.620 |
this is purely unsupervised, and it's very powerful. 00:41:00.060 |
We don't have computers that can understand the kinds of physics that 00:41:04.140 |
So it looks like it's a skill that humans have, and that's very important for 00:41:12.220 |
humans to make sense of the world around us, but 00:41:15.900 |
we haven't really yet succeeded to put in machines. 00:41:18.340 |
Let me tell you other reasons that are connected to this, 00:41:28.220 |
essentially the way you train your system is you focus on a particular task. 00:41:33.020 |
It goes, here's the inputs, and here's the input variables, and 00:41:36.380 |
here's an output variable that I would like you to predict given the input. 00:41:40.820 |
But if you're doing unsupervised learning, essentially you're learning about 00:41:45.660 |
all the possible questions that could be asked about the data that you observe. 00:41:55.460 |
you can predict any of the X given any of the other X, right? 00:41:58.860 |
If I give you a picture and I hide a part of it, you can guess what's missing. 00:42:01.980 |
If I hide the caption, you can generate the caption given the image. 00:42:09.540 |
If I hide the image and I give you the caption, you can guess what the image 00:42:14.260 |
would be or draw it or figure out from examples which one is the most appropriate. 00:42:18.700 |
So you can answer any questions about the data 00:42:22.100 |
when you have captured the joint distribution between them, essentially. 00:42:28.340 |
Another practical thing that unsupervised learning has been used, 00:42:35.820 |
in fact, this is how the whole deep learning thing started, 00:42:41.420 |
Because in addition to telling our model that we want to predict Y given X, 00:42:50.920 |
we're saying find representations of X that both predict Y and 00:42:58.180 |
somehow capture something about the distribution of X, 00:43:01.780 |
the leading factors, the explanatory factors of X. 00:43:06.140 |
And this, again, is making an assumption about the data, so 00:43:09.060 |
we can use that as a regularizer if the assumption is valid. 00:43:11.700 |
Essentially, the assumption is that the factor Y that we're trying to predict 00:43:21.360 |
And that by doing unsupervised learning to discover factors that explain X, 00:43:29.320 |
And so it's gonna be much easier now to do supervised learning. 00:43:32.120 |
Of course, this is also the reason why transfer learning works, 00:43:37.600 |
because there are underlying factors that explain the inputs for a bunch of tasks. 00:43:44.600 |
And maybe a different subset of factors explain are relevant for one task, and 00:43:48.980 |
another subset of factors is relevant for another task. 00:43:53.460 |
then there's a potential for synergy by doing multitask learning. 00:43:59.020 |
So the reason multitask learning is working is because unsupervised learning 00:44:02.940 |
is working, is because there are representations and 00:44:08.300 |
factors that explain the data that can be useful for 00:44:15.120 |
That also could be used for domain adaptation for the same reason. 00:44:18.360 |
The other thing that people don't talk about as much about unsupervised learning, 00:44:26.960 |
and I think it was part of the initial success that we had with stacking auto 00:44:30.760 |
encoders and RBMs, is that you can actually make the optimization problem of 00:44:43.820 |
if you're gonna train a bunch of RBMs or a bunch of auto encoders, and 00:44:48.260 |
I'm not saying this is the right way of doing it, but 00:44:50.540 |
it captures some of the spirit of what unsupervised learning does. 00:44:57.580 |
you're trying to discover some dependencies, that's a local thing. 00:45:00.340 |
Once we have a slightly better representation, we can again tweak it to 00:45:03.540 |
extract better, more independence, or something like that. 00:45:06.460 |
So there's a sense in which the optimization problem 00:45:11.480 |
Another reason why we should care about unsupervised learning, 00:45:16.400 |
even if our ultimate goal is to do supervised learning, 00:45:19.520 |
is because sometimes the output variables are complicated. 00:45:28.520 |
So in machine translation, which we talked about, the output is a sentence. 00:45:33.080 |
A sentence is a set of, is a tuple of words that have a complicated joint 00:45:36.740 |
distribution given the input in the other language. 00:45:39.260 |
And so it turns out that many of the things we discover by exploring 00:45:44.060 |
unsupervised learning, which is essentially about capturing joint distributions, 00:45:48.180 |
can be often used to deal with these structured output problems where 00:45:53.820 |
you have many outputs that form a compositional, complicated distribution. 00:46:01.500 |
There's another reason why unsupervised learning, I think, 00:46:11.120 |
So I think I have another slide just for this. 00:46:29.080 |
How did I learn that I shouldn't do some things with the wheel 00:46:36.360 |
Because I haven't experienced these states where I get killed. 00:46:41.800 |
And I simply haven't done it like a thousand times to learn how to avoid it. 00:46:45.440 |
So supervised learning where our rather traditional reinforcement learning, 00:46:58.680 |
or actor-critic or things like that, won't work because 00:47:10.440 |
I'm never going to encounter, because otherwise if I did, I would die. 00:47:13.320 |
So these are like dangerous states that I need to generalize about these states, 00:47:25.360 |
And I'm sure there are lots of machine learning applications where 00:47:30.600 |
I remember a couple of decades ago, I got some data from a nuclear plant. 00:47:36.880 |
And so they wanted to predict that when it's going to blow up. 00:47:54.440 |
Right, so you see, sometimes it's hard to do supervised learning, 00:47:58.520 |
because the data you'd like to have, you can't have. 00:48:01.360 |
It's data that situations that are very rare. 00:48:08.520 |
Well, the only solution I can see is that we learn enough about the world 00:48:16.600 |
When I'm driving, I have a kind of mental model of physics and 00:48:21.280 |
how cars behave that I can figure out if I turn to right at this point, 00:48:25.880 |
I'm going to end up on the wall and this is going to be very bad for me. 00:48:29.200 |
And I don't need to actually experience that to know that it's bad. 00:48:32.040 |
I can make a mental simulation of what would happen. 00:48:36.440 |
So I need a kind of generative model of how the world would unfold 00:48:43.200 |
And unsupervised learning is sort of the ideal thing to do that. 00:48:48.120 |
But of course, it's going to be hard because we're going to have to train 00:48:51.400 |
models that capture a lot of aspects of the world in order to be able to learn 00:48:57.680 |
to generalize properly in those situations even though they don't see any data of it. 00:49:02.720 |
So that's one reason why I think reinforcement learning needs to be 00:49:20.200 |
I think people who have been doing deep learning can collaborate with people 00:49:28.400 |
not just by providing a black box that they can use in their usual algorithms. 00:49:32.960 |
I think there are things that we do in supervised deep learning that, or 00:49:38.280 |
unsupervised deep learning, that can be useful in sort of rethinking 00:49:44.120 |
So one example, well, one thing I really like to think about is credit assignment. 00:49:52.520 |
In other words, how do different machine learning algorithms figure out what 00:49:56.640 |
the hidden units are supposed to do, what the intermediate computations or 00:50:03.200 |
And that prop is the best recipe we currently have for doing credit assignment. 00:50:09.520 |
It tells the parameters of some intermediate layer how they should change 00:50:14.200 |
so that the cost much, much later, 100 steps later, 00:50:20.320 |
So we could probably use some inspiration from back prop and 00:50:27.440 |
how it's used to improve reinforcement learning. 00:50:38.600 |
when we do supervised back prop, say, we don't predict 00:50:44.680 |
the expected loss that we're gonna have and then try to minimize it. 00:50:50.080 |
Where the expectation would be over the different realizations of the correct 00:51:04.240 |
which is learning the expected value of the future reward or future loss. 00:51:10.000 |
In our case, that might be minus log probability of the correct answer given 00:51:19.160 |
use it to estimate the gradient on the actions. 00:51:23.640 |
Instead, when we do supervised learning, we're gonna do credit assignment 00:51:32.520 |
where we use the particular observations of the correct class that actually 00:51:40.840 |
we use the y to figure out how to change our prediction or our action. 00:51:45.280 |
So it looks like this is something that should be done for RL. 00:51:52.840 |
And in fact, we have a paper on something like this for 00:52:00.560 |
This is the kind of work which is at the intersection of dealing with structured 00:52:04.840 |
outputs, reinforcement learning, and supervised learning. 00:52:07.440 |
So I think there's a lot of potential benefit of 00:52:10.960 |
changing the frame of thinking that people in RL have had. 00:52:17.360 |
For many decades, people in RL have been not thinking about the world 00:52:22.320 |
with the same eyes as people doing neural nets. 00:52:24.400 |
They've been thinking about the world in terms of discrete states that 00:52:28.880 |
could be enumerated and proving theorems about these algorithms that depend on 00:52:35.240 |
essentially collecting enough data to fill all the possible configurations of 00:52:39.000 |
the state and their corresponding effects on the reward. 00:52:44.000 |
When you start thinking in terms of neural nets and 00:52:47.400 |
deep learning, the way to approach problems is very, very different. 00:52:50.360 |
Okay, let me continue about unsupervised learning and 00:52:58.160 |
If you look at the kinds of mistakes that our current machine learning algorithms 00:53:03.240 |
make, you find that our neural nets are just cheating. 00:53:08.920 |
They're using the wrong cues to try to produce the answers. 00:53:14.320 |
And sometimes it works, sometimes it doesn't work. 00:53:15.960 |
So how can we make our models be smarter, make less mistakes? 00:53:25.440 |
Well, The only solution is to make sure that those models 00:53:31.760 |
really understand how the world works, at least at the level of humans, 00:53:38.120 |
to get human level accuracy, human level performance. 00:53:40.440 |
It may be not necessary to do this for a particular problem you're trying to solve. 00:53:46.160 |
So maybe we can get away with doing speech recognition without really 00:53:55.560 |
But for other tasks, especially those involving language, I think having 00:54:01.320 |
models that actually understand how the world ticks is gonna be very, very important. 00:54:04.640 |
So how can we have machines that understand how the world works? 00:54:12.720 |
Well, one of the ideas that I've been talking a lot about in the last decade 00:54:18.280 |
is that of disentangling factors of variation. 00:54:22.120 |
This is related to a very old idea in pattern recognition, 00:54:27.720 |
The idea of invariance was that we would like to compute or 00:54:31.760 |
design, initially design and now learn features, say, 00:54:35.280 |
of the image that are invariant to the things we don't care about. 00:54:39.160 |
Maybe we wanna do object recognition, so we don't care about position or orientation. 00:54:43.960 |
So we would like to have features that are translation invariant, 00:54:46.520 |
rotation invariant, scaling invariant, whatever. 00:54:50.880 |
But when you're in the business of doing unsupervised learning, 00:54:55.880 |
it's not good enough to extract invariant features. 00:54:59.160 |
What we actually wanna do is to extract all of the factors that explain the data. 00:55:04.160 |
So if we're doing speech recognition, we want not only to extract the phonemes, 00:55:09.280 |
but we also want to figure out what kind of voice is that? 00:55:14.120 |
What kind of recording conditions or what kind of microphone? 00:55:20.320 |
All of that information which you're trying to get rid of normally, 00:55:28.400 |
that you'll be able to generalize even to new tasks, for example. 00:55:31.440 |
Maybe the next day I'm not gonna ask you to recognize phonemes, but 00:55:35.120 |
More generally, if we're able to disentangle these factors that explain 00:55:41.160 |
how the data varies, everything becomes easy. 00:55:43.920 |
Especially if those factors now can be generated in an independent way, 00:55:51.240 |
For example, we can learn to answer a question that only depends on one or 00:56:00.360 |
two factors, and basically we have eliminated all the other ones because we 00:56:10.760 |
There's another notion, which is the notion of multiple levels of abstraction, 00:56:14.760 |
which is of course at the heart of what we're trying to do with deep learning. 00:56:19.040 |
And the idea is that we can have representations of the world, 00:56:25.040 |
representation of the data as a description that involves factors, 00:56:32.320 |
or features, and we can do that at multiple levels. 00:56:44.600 |
So if I'm looking at a document, there's the level of the pixels, 00:56:49.000 |
the level of the strokes, the level of the characters, the level of the words, 00:56:53.240 |
and maybe the level of the meaning of individual words. 00:56:55.640 |
And we actually have systems that will recognize from a scanned document 00:57:00.640 |
When we go higher up, we're not sure what the right levels are, but 00:57:04.360 |
clearly there must be representations of the meaning, not just of single words, 00:57:07.760 |
but of sequences of words and the whole paragraph. 00:57:11.960 |
And why is it important to represent things in that way? 00:57:15.320 |
Because higher levels of abstraction are representations 00:57:21.520 |
from which it is much easier to do things, to answer questions. 00:57:26.360 |
So the more semantic levels mean basically we can 00:57:31.000 |
very easily act on the information when it's represented that way. 00:57:34.080 |
If we think about the level of words, it's much easier to check whether a particular 00:57:38.680 |
word is in the document if I have the words extracted than if I have to do it 00:57:44.160 |
And if I have to answer a complicated question about the intention of the person, 00:57:49.080 |
working at level of words is not high enough, it's not abstract enough. 00:57:51.800 |
I need to work at a more abstract level in which maybe the same 00:57:56.880 |
notion could be represented with many different types of words, 00:58:01.080 |
where many different sentences could express the same meaning, and 00:58:05.800 |
So the last slide I have is something that I've been working on in 00:58:13.920 |
the last couple of years, which is connected to unsupervised learning, 00:58:21.640 |
but more generally to the relationship between how 00:58:27.360 |
we can build intelligent machines and the intelligence of humans or animals. 00:58:37.840 |
motivations for doing neural nets in the first place. 00:58:41.720 |
The intuition is this, that we are hoping that there are a few simple 00:58:49.080 |
key principles that explain what allows us to be intelligent. 00:58:55.720 |
And that if we can discover these principles, 00:58:59.360 |
of course, we can also build machines that are intelligent. 00:59:02.000 |
That's why the neural nets were inspired by things we know from the brain in 00:59:10.760 |
We don't know if this is true, but if it is, then it's great. 00:59:18.280 |
And I mean, this would make it much easier to understand how brains work, 00:59:23.400 |
So in trying to bridge this gap, because right now, 00:59:30.600 |
our best neural nets are very, very different from what's going on in brains, 00:59:34.560 |
as far as we can tell by talking to neuroscientists. 00:59:45.960 |
kicking ass from a machine learning point of view, it's not clear at all how 00:59:50.480 |
something like this would be implemented in brains. 00:59:52.680 |
So I've been trying to explore that, and also trying to see how we could 00:59:59.480 |
generalize those credit assignment principles that would come out 01:00:13.240 |
A couple of years ago, I came up with an idea called target prop, 01:00:39.800 |
implement gradient estimation in deep recurrent networks 01:00:44.840 |
that perform some computation that turn out to end up with 01:00:52.320 |
parameter updates corresponding to gradient descent in the prediction error 01:00:56.840 |
that look like something that neuroscientists have been observing and 01:01:01.840 |
don't completely understand called STDP, spike timing dependent plasticity. 01:01:05.440 |
So I don't really have time to go into this, but I think this whole area 01:01:12.200 |
of reconnecting neuroscience with machine learning and neural nets is 01:01:18.400 |
something that has been kind of forgotten by the machine learning community because 01:01:22.480 |
we're all so busy building self-driving cars. 01:01:24.640 |
>> [LAUGH] >> But I think over the long term, 01:02:01.540 |
My question is regarding the lack of interlap between the results in 01:02:06.300 |
the study of complex networks, like when they study the brain networks, right? 01:02:11.340 |
There are a lot of publications that talk about the emergence of hubs, and 01:02:16.180 |
especially a lot of publications on the degree distribution of 01:02:21.940 |
>> But then when you look at the degree distribution of the so-called neurons in 01:02:26.740 |
deep nets, you don't get to see the emergence of the hub behavior. 01:02:30.940 |
So why do you think that there's such lack of overlap between the results? 01:02:36.100 |
>> Because I think the hub story is maybe not that important. 01:02:41.620 |
First of all, I really think that in order to understand the brain, 01:02:47.460 |
you have to understand learning in the brain. 01:02:49.660 |
And if we look at our experience in machine learning and deep learning, 01:02:55.580 |
although the architecture does matter, what matters even more 01:03:00.940 |
is the general principles that allow us to train these things. 01:03:04.220 |
So I think the study of the connectivity makes sense. 01:03:14.620 |
adding a way to have a short number of hubs to go from anywhere to anywhere is 01:03:18.540 |
a reasonable idea, but I don't think it really explains that much. 01:03:26.060 |
The central question is, how does the brain learn complicated things? 01:03:30.500 |
And it does it better than our current machines, 01:03:35.820 |
yet we don't know even a simple way of training 01:03:42.660 |
brains that at least fits the biology reasonably. 01:03:48.820 |
>> There any cases of real world examples where the curse of 01:03:52.900 |
dimensionality is still a problem for neural nets? 01:04:02.820 |
So Andrew told us yesterday that we can just add more data and 01:04:11.260 |
computing power and for some problems this may work. 01:04:14.580 |
But sometimes the amount of data you would need is just too large 01:04:21.700 |
And we'll need also to develop, how did you call it, the Hail Mary. 01:04:30.540 |
We also need to do some research on the algorithms and 01:04:33.500 |
the architectures to be able to learn about how the world is organized so 01:04:40.140 |
that we can generalize in much more powerful ways. 01:04:43.220 |
And that is needed because the kind of task we want to solve involve many, 01:04:51.060 |
many variables that have an exponentially large number of possible values. 01:04:54.820 |
And that's the curse of dimensionality essentially. 01:04:57.300 |
So it's facing pretty much all of the AI problems around us. 01:05:02.020 |
>> Hi, I have a question on multi-agent reinforcement learning. 01:05:09.100 |
>> If you assume all cars can never predict all possible potential accidents, 01:05:15.140 |
what about the potential for transfer learning and things like that? 01:05:19.940 |
>> Yeah, so I was giving an example of a single human learning how to drive. 01:05:25.220 |
We might be able to use the millions of people using self-driving cars, 01:05:31.460 |
correcting, and some of them making accidents to actually make some progress 01:05:37.660 |
And this is probably what you're going to be doing for a while. 01:05:42.940 |
We should definitely use all the data we have. 01:05:45.140 |
Currently, if you look at the amount of data we're using for 01:05:49.460 |
it's hugely more than what any human actually sees in their lifetime. 01:06:11.820 |
>> Well, so one thing that strikes me is that most of this training, 01:06:18.900 |
>> Well, there's quite a bit of work on video these days. 01:06:24.220 |
>> I mean, I've seen dogs generated by GANs, right? 01:06:30.460 |
Well, keep in mind we were doing MNIST just a couple of years ago. 01:06:40.540 |
Yeah, I don't think it's a fundamental issue. 01:06:44.140 |
If we're able to do it well on static images, 01:06:48.500 |
the same principles will allow us to do sequences. 01:06:55.260 |
For example, an interesting project is speech synthesis with recurrent nets and 01:07:01.780 |
stuff like that, or convolutional nets, whatever. 01:07:04.180 |
So it's more like we're not sure how to train them well and 01:07:10.260 |
how to discover these explanatory factors and so on. 01:07:19.580 |
So we have seen the human error rates versus our algorithms error rates for 01:07:25.220 |
things that we are used to, like image recognition, speech recognition. 01:07:29.860 |
>> So has there ever been an experiment where we try to train humans for 01:07:41.540 |
>> You're asking if these experiments have been done? 01:07:45.140 |
>> I don't know, but I'm sure the humans would beat the hell out of the machines, 01:07:49.840 |
For this kind of thing, humans are able to learn a new task or 01:07:58.020 |
And we know that in order for machines to do as well, 01:08:01.420 |
they just need more sort of common sense, right? 01:08:06.220 |
This is what allows humans to learn so quickly on a few examples. 01:08:12.260 |
>> You presented experimental data where you showed that lots of local minima for 01:08:24.700 |
>> Are these local minima separated widely in parameter space, or are they close by? 01:08:31.620 |
I guess a related question is, once you've trained the network, 01:08:34.060 |
if there are lots of local minima, does that suggest that you could compress 01:08:38.900 |
the network and represent it with far fewer parameters? 01:08:47.100 |
So for your first question, we have some experiments dating from 2009, 01:08:51.940 |
where we try to visualize in 2D the trajectories of training. 01:08:58.420 |
So this is a paper, first author is Dmitri Aron, former PhD students with me, 01:09:03.820 |
where we wanted to see how, depending on where you start, 01:09:12.100 |
Do different trajectories end up in the same place? 01:09:19.260 |
And so the number of local minima is much larger than the number of trajectories 01:09:26.340 |
It's so much larger that no two random initial Cs end up near each other. 01:09:32.460 |
So it looks like there's a huge number of local minima, 01:09:35.580 |
which is in agreement with the theory that there's an exponential number of them. 01:09:39.100 |
But the good news is they're all kind of equivalent in terms of cost, 01:09:44.580 |
>> Is that just compressibility at all, or how can you? 01:09:53.380 |
I'm sure there are many ways to compress these networks. 01:10:05.140 |
Like you could flip all, take that unit, put it here, take that unit, put it here, 01:10:13.540 |
But I don't think you're going to gain a lot of bits from that. 01:10:15.420 |
>> So we've talked about that one of the main advantages of deep learning is that 01:10:25.260 |
But you were mentioning before that we need also to 01:10:29.460 |
capture the ability of humans of working with fewer data. 01:10:33.060 |
>> Yeah, but the reason we're able to work with fewer data is because we have first 01:10:37.780 |
learned from a lot of data about the general knowledge of the world. 01:10:43.260 |
>> Right, so how can we adapt neural networks to 01:10:51.660 |
>> We have to do a lot better at unsupervised learning, and 01:10:57.140 |
of the kind that really discovers sort of explanations about the world. 01:11:18.300 |
first an announcement you might remember yesterday, 01:11:23.820 |
Carl invited all the women here for an informal dinner. 01:11:28.220 |
It's going to be right outside right now after we close. 01:11:33.340 |
So before we close, actually, I'd like to thank all the speakers today and yesterday. 01:11:38.500 |
I think everybody appreciated their talk, so thanks again, all of you.